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Preface to the Second Edition 


Since the publication in 1983 of Theory of Point Estimation, much new work 
has made it desirable to bring out a second edition. The inclusion of the new 
material has increased the length of the book from 500 to 600 pages; of the 
approximately 1000 references about 25% have appeared since 1983. 

The greatest change has been the addition to the sparse treatment of Bayesian 
inference in the first edition. This includes the addition of new sections on 
Equivariant, Hierarchical, and Empirical Bayes, and on their comparisons. Other 
major additions deal with new developments concerning the information in¬ 
equality and simultaneous and shrinkage estimation. The Notes at the end of 
each chapter now provide not only bibliographic and historical material but also 
introductions to recent development in point estimation and other related topics 
which, for space reasons, it was not possible to include in the main text. The 
problem sections also have been greatly expanded. On the other hand, to save 
space most of the discussion in the first edition on robust estimation (in particu¬ 
lar L, M, and R estimators) has been deleted. This topic is the subject of two 
excellent books by Hampel et al (1986) and Staudte and Sheather (1990). Other 
than subject matter changes, there have been some minor modifications in the 
presentation. For example, all of the references are now collected together at 
the end of the text, examples are listed in a Table of Examples, and equations 
are references by section and number within a chapter and by chapter, section 
and number between chapters. 

The level of presentation remains the same as that of TPE. Students with a 
thorough course in theoretical statistics (from texts such as Bickel and Doksum 
1977 or Casella and Berger 1990) would be well prepared. The second edition of 
TPE is a companion volume to “Testing Statistical Hypotheses, Second Edition 
(TSH2).” Between them, they provide an account of classical statistics from a 
unified point of view. 

Many people contributed to TPE2 with advice, suggestions, proofreading and 
problem-solving. We are grateful to the efforts of John Kimmel for overseeing 
this project; to Matt Briggs, Lynn Eberly, Rich Levine and Sam Wu for proof¬ 
reading and problem solving, to Larry Brown, Anirban DasGupta, Persi 
Diaconis, Tom DiCiccio, Roger Farrell, Leslaw Gajek, Jim Hobert, Chuck 
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McCulloch, Elias Moreno, Christian Robert, Andrew Rukhin, Bill Strawderman 
and Larry Wasserman for discussions and advice on countless topics, and to 
June Meyermann for transcribing most of TEP to LaTeX. Lastly, we thank Andy 
Scherrer for repairing the near-fatal hard disk crash and Marty Wells for the 
almost infinite number of times he provided us with needed references. 

E. L. Lehmann 
Berkeley, California 

George Casella 
Ithaca, New York 

March 1998 



Preface to the First Edition 


This book is concerned with point estimation in Euclidean sample spaces. 
The first four chapters deal with exact (small-sample) theory, and their approach 
and organization parallel those of the companion volume. Testing Statistical 
Hypotheses (TSH). Optimal estimators are derived according to criteria such as 
unbiasedness, equivariance, and minimaxity, and the material is organized 
around these criteria. The principal applications are to exponential and group 
families, and the systematic discussion of the rich body of (relatively simple) 
statistical problems that fall under these headings constitutes a second major 
theme of the book. 

A theory of much wider applicability is obtained by adopting a large sample 
approach. The last two chapters are therefore devoted to large-sample theory, 
with Chapter 5 providing a fairly elementary introduction to asymptotic con¬ 
cepts and tools. Chapter 6 establishes the asymptotic efficiency, in sufficiently 
regular cases, of maximum likelihood and related estimators, and of Bayes esti¬ 
mators, and presents a brief introduction to the local asymptotic optimality the¬ 
ory of Hajek and LeCam. Even in these two chapters, however, attention is 
restricted to Euclidean sample spaces, so that estimation in sequential analysis, 
stochastic processes, and function spaces, in particular, is not covered. 

The text is supplemented by numerous problems. These and references to the 
literature are collected at the end of each chapter. The literature, particularly 
when applications are included, is so enormous and spread over the journals of 
so many countries and so many specialties that complete coverage did not seem 
feasible. The result is a somewhat inconsistent coverage which, in part, reflects 
my personal interests and experience. 

It is assumed throughout that the reader has a good knowledge of calculus 
and linear algebra. Most of the book can be read without more advanced mathe¬ 
matics (including the sketch of measure theory which is presented in Section 
1.2 for the sake of completeness) if the following conventions are accepted. 

1. A central concept is that of an integral such as \f dP or if dll. This covers 
both the discrete and continuous case. In the discrete case j if dP becomes X/’ 
(xi)P(xi) where P(xl) = P(X = x,) and \f d\l becomes Xfixf. In the continuous case, 
\f dP and \f d\l become, respectively, \f(x)p(x) dx and iflx) dx. Little is lost 
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(except a unified notation and some generality) by always making these substi¬ 
tutions. 

2. When specifying a probability distribution, P, it is necessary to specify not 
only the sample space X, but also the class Ct of sets over which P is to be 
defined. In nearly all examples X will be a Euclidean space and Ct a large class 
of sets, the so-called Borel sets, which in particular includes all open and closed 
sets. The references to Ct can be ignored with practically no loss in the under¬ 
standing of the statistical aspects. 

A forerunner of this book appeared in 1950 in the form of mimeographed 
lecture notes taken by Colin Blyth during a course I taught at Berkeley; they 
subsequently provided a text for the course until the stencils gave out. Some 
sections were later updated by Michael Stuart and Fritz Scholz. Throughout the 
process of converting this material into a book, I greatly benefited from the 
support and advice of my wife, Juliet Shaffer. Parts of the manuscript were read 
by Rudy Beran, Peter Bickel, Colin Blyth, Larry Brown, Fritz Scholz, and Geoff 
Watson, all of whom suggested many improvements. Sections 6.7 and 6.8 are 
based on material provided by Peter Bickel and Chuck Stone, respectively. Very 
special thanks are due to Wei-Yin Loh, who carefully read the complete manu¬ 
script at its various stages and checked all the problems. His work led to the 
corrections of innumerable errors and to many other improvements. Finally, I 
should like to thank Ruth Suzuki for her typing, which by now is legendary, 
and Sheila Gerber for her expert typing of many last-minute additions and cor¬ 
rections. 


E.L. Lehmann 
Berkeley, California, 


March 1983 
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Preparations 


1 The Problem 

Statistics is concerned with the collection of data and with their analysis and 
interpretation. We shall not consider the problem of data collection in this book 
but shall take the data as given and ask what they have to tell us. The answer 
depends not only on the data, on what is being observed, but also on background 
knowledge of the situation; the latter is formalized in the assumptions with which 
the analysis is entered. There have, typically, been three principal lines of approach: 

Data analysis. Here, the data are analyzed on their own terms, essentially without 
extraneous assumptions. The principal aim is the organization and summarization 
of the data in ways that bring out their main features and clarify their underlying 
structure. 

Classical inference and decision theory. The observations are now postulated 
to be the values taken on by random variables which are assumed to follow a 
joint probability distribution, P, belonging to some known class V. Frequently, 
the distributions are indexed by a parameter, say 9 (not necessarily real-valued), 
taking values in a set, £2, so that 

(1.1) V = {P e ,0eQ}. 

The aim of the analysis is then to specify a plausible value for 9 (this is the 
problem of point estimation), or at least to determine a subset of f2 of which we 
can plausibly assert that it does, or does not, contain 9 (estimation by confidence 
sets or hypothesis testing). Such a statement about 6 can be viewed as a summary 
of the information provided by the data and may be used as a guide to action. 

Bayesian analysis. In this approach, it is assumed in addition that 9 is itself 
a random variable (though unobservable) with a known distribution. This prior 
distribution (specified according to the problem) is modified in light of the data to 
determine a posterior distribution (the conditional distribution of 9 given the data), 
which summarizes what can be said about 9 on the basis of the assumptions made 
and the data. 

These three methods of approach permit increasingly strong conclusions, but 
they do so at the price of assumptions which are correspondingly more detailed 
and possibly less reliable. It is often desirable to use different formulations in 
conjunction; for example, by planning a study (e.g., determining sample size) 
under rather detailed assumptions but performing the analysis under a weaker set 
which appears more trustworthy. In practice, it is often useful to model a problem 
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in a number of different ways. One may then be satisfied if there is reasonable 
agreement among the conclusions; in the contrary case, a closer examination of 
the different sets of assumptions will be indicated. 

In this book. Chapters 2, 3, and 5 will be primarily concerned with the second 
formulation. Chapter 4 with the third. Chapter 6 considers a large-sample treat¬ 
ment of both. (A book-length treatment of the first formulation is Tukey’s classic 
Exploratory Data Analysis, or the more recent book by Hoaglin, Mosteller, and 
Tukey 1985, which includes the interesting approach of Diaconis 1985.) Through¬ 
out the book we shall try to specify what is meant by a “best” statistical procedure 
for a given problem and to develop methods for determining such procedures. 
Ideally, this would involve a formal decision-theoretic evaluation of the problem 
resulting in an optimal procedure. 

Unfortunately, there are difficulties with this approach, partially caused by the 
fact that there is no unique, convincing definition of optimality. Compounding this 
lack of consensus about optimality criteria is that there is also no consensus about 
the evaluation of such criteria. For example, even if it is agreed that squared error 
loss is a reasonable criterion, the method of evaluation, be it Bayesian, frequentist 
(the classical approach of averaging over repeated experiments), or conditional, 
must then be agreed upon. 

Perhaps even more serious is the fact that the optimal procedure and its prop¬ 
erties may depend very heavily on the precise nature of the assumed probability 
model (1.1), which often rests on rather flimsy foundations. It therefore becomes 
important to consider the robustness of the proposed solution under deviations 
from the model. Some aspects of robustness, from both Bayesian and frequentist 
perspectives, will be taken up in Chapters 4 and 5. 

The discussion so far has been quite general; let us now specialize to point 
estimation. In terms of the model (1.1), suppose that g is a real-valued function 
defined over C and that we would like to know the value of g(0) (which may, of 
course, be 9 itself). Unfortunately, 6, and hence g(9), is unknown. However, the 
data can be used to obtain an estimate of g(6), a value that one hopes will be close 
to g(Q). 

Point estimation is one of the most common forms of statistical inference. One 
measures a physical quantity in order to estimate its value; surveys are conducted 
to estimate the proportion of voters favoring a candidate or viewers watching a 
television program; agricultural experiments are carried out to estimate the effect of 
a new fertilizer, and clinical experiments to estimate the improved life expectancy 
or ctue rate resulting from a medical treatment. As aprototype of such an estimation 
problem, consider the determination of an unknown quantity by measuring it. 

Example 1.1 The measurement problem. A number of measurements are taken 
of some quantity, for example, a distance (or temperature), in order to obtain an 
estimate of the quantity 0 being measured. If the n measured values are x \,..., x„, 
a common recommendation is to estimate 9 by their mean 

_ C *1 + • • • + x n ) 

x =-. 

n 

The idea of averaging a number of observations to obtain a more precise value 
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seems so commonplace today that it is difficult to realize it has not always been 
in use. It appears to have been introduced only toward the end of the seventeenth 
century (see Plackett, 1958). But why should the observations be combined in just 
this way? The following are two properties of the mean, which were used in early 
attempts to justify this procedure. 

(i) An appealing approximation to the true value being measured is the value a, 
for which the sum of squared difference Y. (x, — a) 2 is a minimum. That this 
least squares estimate of 6 is x is seen from the identity 

(1.2) Y(x{ — a) 2 = £(.*; — x) 2 + n(x — a) 2 , 

since the first term on the right side does not involve a and the second term 
is minimized by a = x . (For the history of least squares, see Eisenhart 1964, 
Plackett 1972, Harter 1974-1976, and Stigler 1981. Least squares estimation 
will be discussed in a more general setting in §3.4.) 

(ii) The least squares estimate defined in (i) is the value minimizing the sum of the 
squared residuals, the residuals being the differences between the observations 
Xj and the estimated value. Another approach is to ask for the value a for which 
the sum of the residuals is zero, so that the positive and negative residuals are 
in balance. The condition on a is 

(1.3) E(x,-a) = 0, 

and this again immediately leads to a = x. (That the two conditions lead to the 
same answer is, of course, obvious since (1.3) expresses that the derivative of 
(1.2) with respect to a is zero.) 

These two principles clearly belong to the first (data analytic) level mentioned 
at the beginning of the section. They derive the mean as a reasonable descriptive 
measure of the center of the observations, but they cannot justify x as an estimate 
of the true value 9 since no explicit assumption has been made connecting the 
observations Xj with 9. To establish such a connection, let us now assume that 
the Xj are the observed values of n independent random variables which have a 
common distribution depending on 9. Eisenhart (1964) attributes the crucial step 
of introducing such probability models for this purpose to Simpson (1755). 

More specifically, we shall assume that Xj = 9 + Uj, where the measurement 
error Uj is distributed according to a distribution F symmetric about 0 so that the 
Xj are symmetrically distributed about 9 with distribution 

(1.4) P(Xj <x)= F(x -6). 

In terms of this model, can we now justify the idea that the mean provides a more 
precise value than a single observation? The second of the approaches mentioned 
at the beginning of the section (classical inference) suggests the following kind of 
consideration. 

If the Z’s are independent and have a finite variance cr 2 , the variance of the 
mean X is cr 2 /n; the expected squared difference between X and 9 is therefore 
only 1 /n of what it is for a single observation. However, if the A’s have a Cauchy 
distribution, the distribution of X is the same as that of a single Xj (Problem 1.6), 
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so that nothing is gained by taking several measurements and then averaging them. 
Whether X is a reasonable estimator of 9 thus depends on the nature of the X,. || 


This example suggests that the formalization of an estimation problem involves 
two basic ingredients: 

(a) A real-valued function g defined over a parameter space Q, whose value at 9 
is to be estimated; we shall call g(0) the estimand. [In Example 1.1, g{0) = 9.] 

(b) A random observable X (typically vector-valued) taking on values in a sample 
space X according to a distribution P,, , which is known to belong to a family 
V as stated in (1.1). [In Example 1.1, X = (X\,..., X n ), where the X, are 
independently, identically distributed (iid) and their distribution is given by 

(1.4). The observed value x of X constitutes the data.] 

The problem is the determination of a suitable estimator. 

Definition 1.2 An estimator is a real-valued function <5 defined over the sample 
space. It is used to estimate an estimand, g(9), a real-valued function of the pa¬ 
rameter. 

Of course, it is hoped that S(X) will tend to be close to the unknown g(9), but 
such a requirement is not part of the formal definition of an estimator. The value 
<5 (.t) taken on by S(X) for the observed value x of X is the estimate of g(9), which 
will be our “educated guess” for the unknown value. 

One could adopt a slightly more restrictive definition than Definition 1.2. In 
applications, it is often desirable to restrict S to possible values of g(9), for example, 
to be positive when g takes on only positive values, to be integer-valued when g 
is, and so on. For the moment, however, it is more convenient not to impose this 
additional restriction. 

The estimator S is to be close to g(9), and since S(X) is a random variable, 
we shall interpret this to mean that it will be close on the average. To make this 
requirement precise, it is necessary to specify a measure of the average closeness 
of (or distance from) an estimator to g{9). Examples of such measures are 


(1.5) P(|<5(X) — g(0)| < c) for some c>0 

and 


(1.6) E\S(X) — g(6)\ p for some p > 0. 

(Of these, we want the first to be large and the second to be small.) If g and S take 
on only positive values, one may be interested in 


E 


S(X) ] p 

m 


which suggests generalizing (1.6) to 


(1.7) 


K(6)E\8(X)-g(e)\ p . 


Quite generally, suppose that the consequences of estimating g(9) by a value d 
are measured by L{6, d). Of the loss function L, we shall assume that 


(1.8) 


L(9,d)> 0 for all 0,d 
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and 

(1.9) L[9,g(9)] = 0 for all 6, 

so that the loss is zero when the correct value is estimated. The accuracy, or rather 
inaccuracy, of an estimator S is then measured by the risk function 

(1.10) R(9,8)= E 0 {L[9,8(X)]}, 

the long-term average loss resulting from the use of <5. One would like to find a 8 
which minimizes the risk for all values of 9. 

As stated, this problem has no solution. For, by (1.9), it is possible to reduce the 
risk at any given point 9q to zero by making S(x) equal to g(0o) for all x. There 
thus exists no uniformly best estimator, that is, no estimator which simultaneously 
minimizes the risk for all values of 9, except in the trivial case that g(6) is constant. 

One way of avoiding this difficulty is to restrict the class of estimators by ruling 
out estimators that too strongly favor one or more values of 9 at the cost of ne¬ 
glecting other possible values. This can be achieved by requiring the estimator to 
satisfy some condition which enforces a certain degree of impartiality. One such 
condition requires that the bias i^t^X)] — g(9), sometimes called the systematic 
error, of the estimator 8 be zero, that is, that 

(1.11) £ e [<5(X)] = g(9) for all 9 e Q. 

This condition of unbiasedness ensures that, in the long run, the amounts by which 
8 over- and underestimates g{9) will balance, so that the estimated value will be 
correct “on the average.” A somewhat similar condition is obtained by considering 
not the amount but only the frequency of over- and underestimation. This leads to 
the condition 

(1.12) P e [8(X) < g(9)] = Pe[8{X) > g(6)] 

or slightly more generally to the requirement that g(9) be a median of S(X) for all 
values of 9. To distinguish it from this condition of median-unbiasedness, (1.11) 
is called mean-unbiasedness if there is a possibility of confusion. 

Mean-unbiased estimators, due to Gauss and perhaps the most classical of all 
frequentist constructions, are treated in Chapter 2. There, we will also consider 
performance assessments that naturally arise from unbiasedness considerations. 
[A more general unbiasedness concept, of which (1.11) and (1.12) are special 
cases, will be discussed in Section 3.1.] 

A different impartiality condition can be formulated when symmetries are present 
in a problem. It is then natural to require a corresponding symmetry to hold for 
the estimator. The resulting condition of equivariance will be explored in Chapter 
3 and will also play a role in the succeeding chapters. 

In many important situations, unbiasedness and equivariance lead to estima¬ 
tors that are uniformly best among the estimators satisfying these restrictions. 
Nevertheless, the applicability of both conditions is limited. There is an alterna¬ 
tive approach which is more generally applicable. Instead of seeking an estimator 
which minimizes the risk uniformly in 9, one can more modestly ask that the risk 
function be low only in some overall sense. Two natural global measures of the 
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size of the risk are the average 

(1.13) J R(9, 8)w(0)d0 

for some weight function w and the maximum of the risk function 

(1.14) sup R(G,8). 

n 

The estimator minimizing (1.13) (discussed in Chapter 4) formally coincides with 
the Bayes estimator when 9 is assumed to be a random variable with probabil¬ 
ity density w. Minimizing (1.14) leads to the minimax estimator, which will be 
considered in Chapter 5. 

The formulation of an estimation problem in a concrete situation along the lines 
described in this chapter requires specification of the probability model (1.1) and of 
a measure of inaccuracy L(6 , cl). In the measurement problem of Example 1.1 and 
its generalizations to linear models, it is frequently reasonable to assume that the 
measurement errors are approximately normally distributed. In other situations, 
the assumptions underlying a binomial or Poisson distribution may be appropri¬ 
ate. Thus, knowledge of the circumstances and previous experience with similar 
situations will often suggest a particular parametric family V of distributions. If 
such information is not available, one may instead adopt a nonparametric model, 
which requires only very general assumptions such as independence or symmetry 
but does not lead to a particular parametric family of distributions. As a compro¬ 
mise between these two approaches, one may be willing to assume that the true 
distribution, though not exactly following a particular parametric form, lies within 
a stated distance of some parametric family. For a theory of such neighborhood 
models see, for example, Huber (1981) or TSH2, Section 9.3. 

The choice of an appropriate model requires judgment and utilizes experience; 
it is also affected by considerations of convenience. Analogous considerations 
for choice of the loss function L appear to be much more difficult. The most 
common fate of a point estimate (for example, of the distance of a star or the 
success probability of an operation) is to wind up in a research report or paper. It 
is likely to be used on different occasions and in various settings for a variety of 
purposes which cannot be foreseen at the time the estimate is made. Under these 
circumstances, one wants the estimator to be accurate, but just what measure of 
accuracy should be used is fairly arbitrary. 

This was recognized very clearly by Laplace (1820) and Gauss (1821), who 
compared the estimation of an unknown quantity, on the basis of observations 
with random errors, with a game of chance and the error in the estimated value 
with the loss resulting from such a game. Gauss proposed the square of the error 
as a measure of loss or inaccuracy. Should someone object to this specification 
as arbitrary, he writes, he is in complete agreement. He defends his choice by an 
appeal to mathematical simplicity and convenience. Among the infinite variety 
of possible functions for the purpose, the square is the simplest and is therefore 
preferable. 

When estimates are used to make definite decisions (for example, to determine 
the amount of medication to be given a patient or the size of an order that a store 
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should place for some goods), it is sometimes possible to specify the loss function 
by the consequences of various errors in the estimate. A general discussion of the 
distinction between inference and decision problems is given by Blyth (1970) and 
Barnett (1982). 

Actually, it turns out that much of the general theory does not require a detailed 
specification of the loss function but applies to large classes of such functions, 
in particular to loss functions L(0, d), which are convex in d. [For example, this 
includes (1.7) with p > 1 but not with p < 1. It does not include (1.5)]. We shall 
develop here the theory for suitably general classes of loss functions whenever the 
cost in complexity is not too high. However, in applications to specific examples 
— and these form a large part of the subject — the choice of squared error as loss 
has the twofold advantage of ease of computation and of leading to estimators that 
can be obtained explicitly. For these reasons, in the examples we shall typically 
take the loss to be squared error. 

Theoretical statistics builds on many different branches of mathematics, from 
set theory and algebra to analysis and probability. In this chapter, we will present 
an overview of some of the most relevant topics needed for the statistical theory 
to follow. 


2 Measure Theory and Integration 

A convenient framework for theoretical statistics is measure theory in abstract 
spaces. The present section will sketch (without proofs) some of the principal 
concepts, results, and notational conventions of this theory. Such a sketch should 
provide sufficient background for a comfortable understanding of the ideas and 
results and the essentials of most of the proofs in this book. A fuller account of 
measure theory can be found in many standard books, for example, Halmos (1950), 
Rudin (1966), Dudley (1989), and Billingsley (1995). 

The most natural example of a “measure” is that of the length, area, or volume 
of sets in one-, two-, or three-dimensional Euclidean space. As in these special 
cases, a measure assigns non-negative (not necessarily finite) values to sets in some 
space X. A measure p is thus a set function; the value it assigns to a set A will be 
denoted by p(A). 

In generalization of the properties of length, area, and volume, a measure will 
be required to be additive, that is, to satisfy 

(2.1) p(AU B) = p(A) + p(B) when A, B are disjoint, 

where A U B denotes the union of A and B. From (2.1), it follows immediately by 
induction that additivity extends to any finite union of disjoint sets. The measures 
with which we shall be concerned will be required to satisfy the stronger condition 
of sigma-additivity, namely that 

( °° \ oo 

U A/ ) =X>(A,) 

for any countable collection of disjoint sets. 
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The domain over which a measure p is defined is a class of subsets of A\ It would 
seem easiest to assume that this is the class of all subsets of X. Unfortunately, it 
turns out that typically it is not possible to give a satisfactory definition of the 
measures of interest for all subsets of X in such a way that (2.2) holds. [Such 
a negative statement holds in particular for length, area, and volume (see, for 
example, Halmos (1950), p. 70) but not for the measure p of Example 2.1 below.] 
It is therefore necessary to restrict the definition of p to a suitable class of subsets 
of X. This class should contain the whole space X as a member, and for any set A 
also its complement X — A. In view of (2.2), it should also contain the union of any 
countable collection of sets of the class. A class of sets satisfying these conditions 
is called a a -field or o -algebra. It is easy to see that if A\ , A 2 ,... are members of 
a a -field A. then so are their union and intersection (Problem 2.1). 

If A is a cr-field of subsets of a space X , then (X, A) is said to be a measurable 
space and the sets A of A to be measurable. A measure p is a nonnegative set 
function defined over a cr-field A and satisfying (2.2). If p is a measure defined 
over a measurable space ( X , A), the triple ( X , A. p) is called a measure space. 

A measure is o-finite if there exist sets A,- in A whose union is X and such 
that p(A,) < 00 . All measures with which we shall be concerned in this book are 
a-finite, and we shall therefore use the term measure to mean a a -finite measure. 

The following are two important examples of measure spaces. 

Example 2.1 Counting measure. Let X be countable and A the class of all 
subsets of X. For any A in A, let p(A) be the number of points of A if A is 
finite, and p(A) = 00 otherwise. This measure p is called counting measure. That 
/x is a -finite is obvious. j 

Example 2.2 Lebesgue measure. Let X be n -dimensional Euclidean space E„, 
and let A be the smallest a-field containing all open rectangles 

(2.3) A = {(xi,..., x n ) : a,- < Xi < /?,}, —00 < a ; < /?, < 00 . 

We shall then say that (X, A) is Euclidean. The members of A are called Borel 
sets. This is a very large class which contains, among others, all open and all closed 
subsets of X. There exists a (unique) measure //, defined over A. which assigns 
to (2.3) the measure 

(2.4) p.(A) = (b l -a\)--Ab n -a n ), 

that is, its volume; p is called Lebesgue measure. j 

The intuitive meaning of measure suggests that any subset of a set of measure 
zero should again have measure zero. If (X, A, p) is a measure space, it may, 
however, happen that a subset of a set in A which has measure zero is not in 
A and hence not measurable. This difficulty can be remedied by the process of 
completion. Consider the class B of all sets 5 = AUC where A is in A and C is 
a subset of a set in A having measure zero. Then, I? is a er-field (Problem 2.7). If 
pt' is defined over B by pt'(B) = pt{A ), p! agrees with p over A, and ( X , B, p') is 
called the completion of the measure space (X, A. p). 

When the process of completion is applied to Example 2.1 so that X is Euclidean 
and A is the class of Borel sets, the resulting larger class B is the class of Lebesgue 
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measurable sets. The measure ji' defined over B, which agrees with Lebesgue 
measure over the Borel sets, is also called Lebesgue measure. 

A third principal concept needed in addition to a -field and measure is that of the 
integral of a real-valued function / with respect to a measure /i. However, before 
defining this integral, it is necessary to specify a suitable class of functions /. This 
will be done in three steps. 

First, consider the class of real-valued functions s called simple, which take on 
only a finite number of values, say a i, ..., a m , and for which the sets 

(2.5) A, = {.r : six) = a, } 

belong to A. An important special case of a simple function is the indicator I A of 
a set A in A, defined by 

T , \ T/ A\ 1 if X G A 

(2.6) / A (*) = /(*e A) = Q , fx¥A 

If the set A is an interval, for example (a , b] , the indicator function of the interval 
may be written in the alternate form I(a < x < b). 

Second, let si, S 2 , ■ ■ ■ be a nondecreasing sequence of non-negative simple func¬ 
tions and let 

(2.7) f(x) = lim •?„(*). 

n —> oo 

Note that this limit exists since for every x, the sequence .vi (x). s^C*),... is non¬ 
decreasing but that f(x) may be infinite. A function with domain X and range 
[0, oo), that is, non-negative and finite valued, will be called A-measurable or, for 
short, measurable if there exists a nondecreasing sequence of non-negative simple 
functions such that(2.7) holds for all .v e X. 

Third, for an arbitrary function /, define its positive and negative part by 

f + (x ) = max(/(x), 0), f~(x) = -min(/(x), 0), 

so that f + and f~ are both non-negative and 

f = r-r- 

Then a function with domain X and range (—oo, oo) will be called measurable if 
both its positive and its negative parts are measurable. The measurable functions 
constitute a very large class which has a simple alternative characterization. 

It can be shown that a real-valued function / is A-measurable if and only if, for 
every Borel set B on the real line, the set 

{x : fix) e B } 

is in A. If follows from the definition of Borel sets that it is enough to check that 
{x : fix) < b\ is in A for every b. This shows in particular that if (X, A) is 
Euclidean and / continuous, then / is measurable. As another important class, 
consider functions taking on a countable number of values. If / takes on distinct 
values a i, « 2 ,. .. on sets A i, /L. ..., it is measurable if and only if A, e A for all 
i. 

The integral can now be defined in three corresponding steps. 
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(i) For a non-negative simple function ,v taking on values a, on the sets A, , define 


( 2 . 8 ) 


/ 


s d/i = 


oo. 


where a/u(A) is to be taken as zero when a = 0 and /i(A) 

(ii) For a non-negative measurable function / given by (2.7), define 


(2.9) 


/ fdfi = lim / 

n^ooj 


s„dfi. 


Here, the limit on the right side exists since the fact that the functions s„ 
are nondecreasing implies the same for the numbers / s„diu. The definition 
(2.9) is meaningful because it can be shown that if {,s„} and {s'} are two 
nondecreasing sequences with the same limit function, their integrals also 
will have the same limit. Thus, the value of f fd/i is independent of the 
particular sequence used in (2.7). 

The definitions (2.8) and (2.9) do not preclude the possibility that f s dfi or 
f fdfi is infinite. A non-negative measurable function is integrable (with 
respect to /x) if / fdfi < oo. 

(iii) An arbitrary measurable function / is said to be integrable if its positive and 
negative parts are integrable, and its integral is then defined by 


( 2 . 10 ) 


J fdfi = J f + dfi - J 


f dfi. 


Important special cases of this definition are obtained by taking, for //, the 
measures defined in Examples 2.1 and 2.2. 


Example 2.3 Continuation of Example 2.1. If A = [x \, X 2 , 

ing measure, it is easily seen from (2.8) through (2.10) that 


and fi is count- 


/ 


fdfi = Tif (xj). 


Example 2.4 Continuation of Example 2.2. If fj. is Lebesgue measure, then 
f fdn exists whenever the Riemann integral (the integral taught in calculus 
courses) exists and the two agree. However, the integral defined in (2.8) through 
(2.10) exists for many functions for which the Riemann integral is not defined. A 
simple example is the function / for which f(x) = 1 or 0, as x is rational or irra¬ 
tional. It follows from (2.22) below that the integral of / with respect to Lebesgue 
measure is zero; on the other hand, / is not Riemann integrable (Problem 2.11). || 


In analogy with the customary notation for the Riemann integral, it will fre¬ 
quently be convenient to write the integral (2.10) as f f(x)dfx(x). This is especially 
true when / is given by an explicit formula. 

The integral defined above has the properties one would expect of it. In particular, 
for any real numbers ci,..., c,„ and any integrable functions f\,.... f n , L f) is 
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also integrable and 

(2.11) J(Y,Cifi)dii = Sc,- J fdp. 

Also, if / is measurable and g integrable and if 0 < / < g, then / is also 
integrable, and 

( 2 . 12 ) Jfdp<Jgdp. 

We shall often be dealing with statements that hold except on a set of measure 
zero. If a statement holds for all x in X — N where p(N) = 0, the statement is 
said to hold a.e. (almost everywhere) p (or a.e. if the measure p is clear from the 
context). 

It is sometimes required to know when fix) = lim f„(x) or more generally 
when 

(2.13) f(x ) = lim f n (x) (a.e. p) 
implies that 

(2.14) J fdp = lim J f n dp. 

Here is a sufficient condition. 

Theorem 2.5 (Dominated Convergence) If the f„ are measurable and satisfy 
(2.13), and if there exists an integrable function g such that 

(2.15) \f,(x)\<g(x) for all x, 

then the f„ and f are integrable and (2.14) holds. 

The following is another useful result concerning integrals of sequences of 
functions. 


Lemma 2.6 (Fatou) If {f,} is a sequence of non-negative measurable functions, 
then 


(2.16) 



lim inf 

n—>oo 



< lim inf 

n —> OO 


/ 


fnd/X 


with the reverse inequality holding for limsup. 


Recall that the liminf and limsup of a sequence of numbers are, respectively, the 
smallest and largest limit points that can be obtained through subsequences. See 
Problems 2.5 and 2.6. 

As a last extension of the concept of integral, define 


(2.17) 



I a f dp 


when the integral on the right exists. It follows in particular from (2.8) and (2.17) 
that 

(2.18) / dp = p(A). 

Ja 

Obviously such properties as (2.11) and (2.12) continue to hold when f is replaced 

by f A - 
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It is often useful to know under what conditions an integrable function / satisfies 

(2.19) J fdji = 0. 

This will clearly be the case when either 

(2.20) / = 0 on A 
or 

(2.21) fi(A) = 0. 

More generally, it will be the case whenever 

(2.22) / = 0 a.e. on A, 


that is, / is zero except on a subset of A having measure zero. 
Conversely, if / is a.e. non-negative on A, 

(2.23) j fdn = 0 =>■ / = 0 a.e. on A, 

J A 

and if / is a.e. positive on A, then 


(2.24) 



= 0 =>■ /x(A) = 0. 


Note that, as a special case of (2.22), if / and g are integrable functions differing 
only on a set of measure zero, that is, if / = g (a.e. fi). then 

J fdfi = J g dfi. 

It is a consequence that functions can never be determined by their integrals 
uniquely but at most up to sets of measure zero. 

For a non-negative integrable function /, let us now consider 

(2.25) v(A)= [ fd/i 

J A 

as a set function defined over A. Then v is non-negative, a-finite, and o -additive 
and hence a measure over (X. A). 

If 11 and v are two measures defined over the same measurable space (X, A), it is 
a question of central importance whether there exists a function / such that (2.25) 
holds for all A e A. By (2.21), a necessary condition for such a representation is 
clearly that 

(2.26) n(A) = 0 =>• v(A) = 0. 

When (2.26) holds, v is said to be absolutely continuous with respect to /x. It is a 
surprising and basic fact known as the Radon-Nikodym theorem that (2.26) is not 
only necessary but also sufficient for the existence of a function / satisfying (2.25) 
for all A e A. The resulting function / is called the Radon-Nikodym derivative 
of v with respect to //. This / is not unique because it can be changed on a set of 
//-measure zero without affecting the integrals (2.25). However, it is unique a.e. /i 
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in the sense that if g is any other integrable function satisfying (2.25), then / = g 
(a.e. //). It is a useful consequence of this result that 

J fdpu = 0 for all A e A 

implies that / = 0 (a.e. pi). 

The last theorem on integration we require is a form of Fubini’s theorem which 
essentially states that in a repeated integral of a non-negative function, the order 
of integration is immaterial. To make this statement precise, define the Cartesian 
product A x B of any two sets A, B as the set of all ordered pairs (a, b) with 
a e A, b e B . Let ( X, A, pi) and ( y , B , u) be two measure spaces, and define 
A x B to be the smallest a -field containing all sets Ax B with A e A and B e B. 
Then there exists a unique measure X over Ax B which to any product set A x B 
assigns the measure /x(A) • v(B). The measure X is called the product measure of 
/x and v and is denoted by p x v. 

Example 2.7 Borel sets. If X and y are Euclidean spaces E m and E n and A and 
B the o -fields of Borel sets of X and y respectively, then X x y is Euclidean 
space E m+n , and A x B is the class of Borel sets of X x y. If, in addition, p. and 
v are Lebesgue measure on (X, A) and (y, B ), then pi x v is Lebesgue measure 
on (X x y, A x B). || 


An integral with respect to a product measure generalizes the concept of a double 
integral. The following theorem, which is one version of Fubini’s theorem, states 
conditions under which a double integral is equal to a repeated integral and under 
which it is permitted to change the order of integration in a repeated integral. 

Theorem 2.8 (Fubini) Let (X, A , H) and O’, B, v>) be measure spaces and let / 
be a non-negative A x £>-measurable function defined on X x y. 


Then 


(2.27) 


f f fix, y)dv(y) 
J x L Jy 


dp.(x) = / 

Jy Ux 


/ . 

Jxxy 


fix , y)dp(x) 
fdiji x v). 


dv(y ) 


Here, the first term is the repeated integral in which / is first integrated for fixed 
x with respect to v, and then the result with respect to /i. The inner integrals of 
the first two terms in (2.27) are, of course, not defined unless f(x, y), for fixed 
values of either variable, is a measurable function of the other. Fortunately, under 
the assumptions of the theorem, this is always the case. Similarly, existence of 
the outer integrals requires the inner integrals to be measurable functions of the 
variable that has not been integrated. This condition is also satisfied. 


3 Probability Theory 

For work in statistics, the most important application of measure theory is its 
specialization to probability theory. A measure P defined over a measure space 
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(X, A) satisfying 

(3.1) P(X) = 1 

is a probability measure (or probability distribution ), and the value P{ A) it assigns 
to A is the probability of A. If P is absolutely continuous with respect to a measure 
/x with Radon-Nikodym derivative, p, so that 

(3.2) P(A)= [ pdp, 

J A 

p is called the probability density of P with respect to p. Such densities are, of 
course, determined only up to sets of //-measure zero. 

We shall be concerned only with situations in which X is Euclidean, and typi¬ 
cally the distributions will either be discrete (in which case // can be taken to be 
counting measure) or absolutely continuous with respect to Lebesgue measure. 

Statistical problems are concerned not with single probability distributions but 
with families of such distributions 

(3.3) T = {P 0 ,den} 

defined over a common measurable space (X, A). When all the distributions of V 
are absolutely continuous with respect to a common measure p, as will usually be 
the case, the family V is said to be dominated (by p). 

Most of the examples with which we shall deal belong to one or the other of the 
following two cases. 

(i) The discrete case. Here, A' is a countable set, A is the class of subsets of X, 
and the distributions of V are dominated by counting measure. 

(ii) The absolutely continuous case. Here, A' is a Borel subset of a Euclidean 
space, A is the class of Borel subsets of X , and the distributions of V are 
dominated by Lebesgue measure over (X. A). 

It is one of the advantages of the general approach of this section that it includes 
both these cases, as well as mixed situations such as those arising with censored 
data (see Problem 3.8). 

When dealing with a family V of distributions, the most relevant null-set concept 
is that of a V-null set. that is, of a set N satisfying 

(3.4) P(N) = 0 for all P eV. 

If a statement holds except on a set N satisfying (3.4), we shall say that the statement 
holds (a.e. V). If V is dominated by p. then 

(3.5) p(N) = 0 

implies (3.4). When the converse also holds, p and V are said to be equivalent. 

To bring the customary probabilistic framework and terminology into conso¬ 
nance with that of measure theory, it is necessary to define the concepts of random 
variable and random vector. A random variable is the mathematical representation 
of some real-valued aspect of an experiment with uncertain outcome. The experi¬ 
ment may be represented by a space £, and the full details of its possible outcomes 
by the points e of £. The frequencies with which outcomes can be expected to fall 
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into different subsets E of £ (assumed to form a a -field B) are given by a proba¬ 
bility distribution over {£, B). A random variable is then a real-valued function X 
defined over £. Since we wish the probabilities of the events X < a to be defined, 
the function X must be measurable and the probability 

(3.6) F x (a) = P(X < a) 

is simply the probability of the set {e : X(e) < a}. The function F x defined through 

(3.6) is the cumulative distribution function (cdf) of X. 

It is convenient to digress here briefly in order to define another concept of 
absolute continuity. A real-valued function / on (—oo, oo) is said to be absolutely 
continuous if given any e > 0, there exits 8 > 0 such that for each finite collection 
of disjoint bounded open intervals (a,-, £>,), 

(3.7) Y.(b, - <7,) < <5 implies E|/(/?,) - /(a,-)| < s. 

A connection with the earlier concept of absolute continuity of one measure with 
respect to another is established by the fact that a cdf F on the real line is absolutely 
continuous if and only if the probability measure it generates is absolutely con¬ 
tinuous with respect to Lebesgue measure. Any absolutely continuous function is 
continuous (Problem 3.2), but the converse does not hold. In particular, there exist 
continuous cumulative distribution functions which are not absolutely continuous 
and therefore do not have a probability density with respect to Lebesgue measure. 
Such distributions are rather pathological and play little role in statistics. 

If not just one but n real-valued aspects of an experiment are of interest, these 
are represented by a measurable vector-valued function (X\ ,..., X n ) defined over 
£, with the joint cdf 

(3.8) F x (a u ... , a„) = P[X\ <ai,...,X n < a n ] 
being the probability of the event 

(3.9) {e\X x (e)<a\ -- X n (e) < a„). 

The cdf (3.8) determines the probabilities of (Xi,... X„) falling into any Borel set 
A, and these agree with the probabilities of the events 

{e : [X 1 (e),.. i ,X„(e)] e A}. 

From this description of the mathematical model, one might expect the starting 
point for modeling a specific situation to be the measurable space (£, B) and a fam¬ 
ily V of probability distributions defined over it. However, the statistical analysis 
of an experiment is typically not based on a full description of the experimental 
outcome (which would, for example, include the smallest details concerning all 
experimental subjects) represented by the points e of £. More often, the starting 
point is a set of observations, represented by a random vector X = (X\,..., X„), 
with all other aspects of the experiment being ignored. The specification of the 
model will therefore begin with X, the data ; the measurable space (X , A) in which 
X takes on its values, the sample space ; and a family V of probability distributions 
to which the distribution of X is known to belong. Real-valued or vector-valued 
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measurable functions T = (7)...., Tf) of X are called statistics', in particular, 
estimators are statistics. 

The change of starting point from (£, B) to ( X , A) requires clarification of two 
definitions: (1) In order to avoid reference to (£, B), it is convenient to require T to 
be a measurable function over (X, A) rather than over (£, B). Measurability with 
respect to the original (£, B) is then an automatic consequence (Problem 3.3). (2) 
Analogously, the expectation of a real-valued integrable T is originally defined as 

I T[X(e)]dP(e). 

However, it is legitimate to calculate it instead from the formula 

E(T) = I T(x)dP x (X ) 

where Px denotes the probability distribution of X. 

As a last concept, we mention the support of a distribution P on (X, A). It is 
the set of all points x for which P(A) > 0 for all open rectangles A [defined by 
(2.3)] which contain x. 

Example 3.1 Support. Let X be a random variable with distribution P and cdf 
F, and suppose the support of P is a finite interval I with end points a and b. Then, 
I must be the closed interval [a, b] and F is strictly increasing on [a, b] (Problem 
3.4). || 

If P and Q are two probability measures on (X, A) and are equivalent (i.e., 
each is absolutely continuous with respect to the other), then they have the same 
support; however, the converse need not be true (Problems 3.6 and 3.7). 

Having outlined the mathematical foundation on which the statistical develop¬ 
ments of the later chapters are based, we shall from now on ignore it as far as 
possible and instead concentrate on the statistical issues. In particular, we shall 
pay little or no attention to two technical difficulties that occur throughout. 

(i) The estimators that will be derived are statistics and hence need to be measur¬ 
able. However, we shall not check that this requirement is satisfied. In specific 
examples, it is usually obvious. In more general constructions, it will be tac¬ 
itly understood that the conclusion holds only if the estimator in question is 
measurable. In practice, the sets and functions in these constructions usually 
turn out to be measurable although verification of their measurability can be 
quite difficult. 

(ii) Typically, the estimators are also required to be integrable. This condition 
will not be as universally satisfied in our examples as measurability and will 
therefore be checked when it seems important to do so. In other cases, it will 
again be tacitly assumed. 

4 Group Families 

The two principal families of models with which we shall be concerned in this book 
are exponential families and group families. Between them, these families cover 
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many of the more common statistical models. In this and the next section, we shall 
discuss these families and some of their properties, together with some of the more 
important special cases. More details about these and other special distributions 
can be found in the four-volume reference work on statistical distributions by 
Johnson and Kotz (1969-1972), and the later editions by Johnson, Kotz, and Kemp 
(1992) and Johnson, Kotz, and Balakrishnan (1994,1995). 

One of the main reasons for the central role played by these two families in 
statistics is that in each of them, it is possible to effect a great simplification of the 
data. In an exponential family, there exists a fixed (usually rather small) number of 
statistics to which the data can be reduced without loss of information, regardless 
of the sample size. In a group family, the simplification stems from the fact that the 
different distributions of the family play a highly symmetric role. This symmetry 
in the basic structure again leads essentially to a reduction of the dimension of the 
data since it is then natural to impose a corresponding symmetry requirement on 
the estimator. 

A group family of distributions is a family obtained by subjecting a random 
variable with a fixed distribution to a suitable family of transformations. 

Example 4.1 Location-scale families. Let U be a random variable with a fixed 
distribution F. If a constant a is added to U, the resulting variable 

(4.1) X = U + a 
has distribution 

(4.2) P(X < x) = F(x - a). 

The totality of distributions (4.2), for fixed F and as a varies from — oo to oo, is 
said to constitute a location family. 

Analogously, a scale family is generated by the transformations 

(4.3) X = bU, b > 0, 
and has the form 

(4.4) P(X <x) = F(x/b). 

Combining these two types of transformations into 

(4.5) X = a + bU, b > 0, 
one obtains the location-scale family 

( x — a\ 

In applications of these families, F usually has a density / with respect to 
Lebesgue measure. The density of (4.6) is then given by 



Table 4.1 exhibits several such densities, which will be used in the sequel. i 

In each of (4.1), (4.3), and (4.5), the class of transformations has the following 
two properties. 
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Table 4.1. Location-Scale Families —oo<a<oo,b>0 


Density 

Support 

Name 

Notation 

1 -(x-a) 2 /2b 2 

s/lnb 

—OO < X < oo 

Normal 

N(a, b 2 ) 

_L p -\x-a\/b 

2b ^ 

—OO < X < oo 

Double exponential 

DE(a , b) 

b 1 

it b 2 +(x-a ) 2 

—oo < * < oo 

Cauchy 

C(a, b) 

1 e~(x—o)/b 

b [1 +e~(x—a)/b] 2 

—oo < * < oo 

Logistic 

L(a, b) 

i e -(*-«)/% i00)(x ) 

a < x < oo 

Exponential 

E(a, b) 

ll[a-b/2,a+b/2](x ) 

a — ^ < x < a + ^ 

Uniform 

U(a-\,a+\) 


(i) Closure wider composition. Application of a 1:1 transformation gi from X 
to X followed by another, g 2 , results in a new such transformation called the 
composition of gi with g 2 and denoted by g 2 -gi- For the transformation (4.1), 
addition first of a i and then of « 2 results in the addition of ci\ + a 2 . For (4.3), 
multiplication by b\ and then by /? 2 is equivalent to multiplication by /; 2 ■ b\. 
The composition rule (4.5) is slightly more complicated. First transforming u 
to x = a\ + b\u and then the result to y = a 2 + £> 2 .r results in the transformation 

(4.8) y = a 2 + fo 2 (ai + b\u) = (a 2 + boa i) + bobiii. 

A class J of transformations is said to be closed under composition if gi e 
J\ g 2 e J implies that go- g l e J. We have just shown that the three classes 
of transformations, 

(4.1) with — oo < a < oo, 

(4.9) (4.3) withO < b, 

(4.5) with — oo < a < oo, 0 < b, 

are all closed with respect to composition. On the other hand, the class (4.1) 
with \a\ < 1 is not, since (7+1/2 and U + 2/3 are both members of the class 
but their composition is not. 

(ii) Closure under inversion. Given any 1: 1 transformation x' = gx, let g _1 , 
the inverse of g, denote the transformation which undoes what g did, that is, 
takes x' back to x so that x = g -1 x'. For the transformation which adds a , the 
inverse subtracts a; the inverse in (4.3) of multiplication by b is division by 
b\ and the inverse of a + bu is {x — a)/b. A class J is said to be closed under 
inversion if g e J implies g~ 1 e J. The three classes listed in (4.9) are all 
closed under inversion. On the other hand, (4.1) with 0 < a is not. 





1.4] 


GROUP FAMILIES 


19 


The structure of the class of transformations possessing these properties is a 
special case of a more general mathematical object, simply called a group. 

Definition 4.2 A set G of elements is called a group if it satisfies the following 
four conditions. 

(i) There is defined an operation, group multiplication, which with any two el¬ 
ements a, b e G associates an element c of G. The element c is called the 
product of a and b and is denoted ah. 

(ii) Group multiplication obeys the associative law 

( ab)c = a(bc). 

(iii) There exists an element e e G, called the identity, such that 

ae = ea = a for all a € G. 

(iv) For each element a e G, there exists an element a -1 , its inverse, such that 

flfl -1 = a~ l a = e. 

Both the identity element and the inverse a~ l of any element a can be shown to 
be unique. 

The groups of primary interest in statistics are transformation groups. 

Definition 4.3 A class G of transformations is called a transformation group if it 
is closed under both composition and inversion. 

It is straightforward to verify (Problem 4.4) that a transformation group is, in 
fact, a group. In particular, note that the identity transformation x = x is a member 
of any transformation group G since g e G implies g _1 e G and hence g _1 g e G, 
and by definition, g _1 g is the identity. Note also that the inverse (g -1 ) -1 of g 1 is 
g, so that gg^ 1 is also the identity. 

A transformation group G which satisfies 

82 - g\ = g\ • gi 

for all g i, g 2 e G is called commutative. The first two groups of transformations 
of (4.9) are commutative, but the third is not. 

Example 4.4 Continuation of Example 4.1. The group families (4.2), (4.4), and 
(4.6) generalize easily to the case that U is a vector U = (U\, ..., U„), if one 
defines 

(4.10) U + a = (Ui + a, ,.., U„ + a) and b\] = (blf,..., bU„). 

This covers in particular the case that X\,..., X n are iid according to one of the 
previous families, for example, one of the densities of Table 4.1. Larger group 
families are obtained in the same way by letting 


(4.11) U + a = (t/i + a\,..., U n + a n ) and bU = (Jb\U \,..., b n U n ). 
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Example 4.5 Multivariate normal distribution. As a more special but very im¬ 
portant example, suppose next that U = (U i,, U p ) where the U, are indepen¬ 
dently distributed as N(0, 1) and let 


where B is nonsingular p x p matrix. The resulting family of distributions in 
p-space is the family of nonsingular p-variate normal distributions. If the three 
columns of (4.12) are denoted by X, a, and U, respectively, 1 (4.12) can be written 
as 

(4.13) X = a + 5U. 

From this equation, it is seen that the covariance matrix E of X are given by 

(4.14) E(X) = a and E = E[(X - a)(X - a)'] = BB'. 

To obtain the density of X, write the density of U as 

1 (l/2)u'u 

(72 n)P 

Now U = 5~'(X — a) and the Jacobian of the linear transformation (4.13) is just 
the determinant | B\ of B. Thus, by the usual formula for transforming densities, 
the density of X is seen to be 

(4 25) ^ c -(x-a) , S~ 1 (x-a)/2 

(72 n)P 

For the case p = 2, this reduces to (Problem 4.6) 

24 J5) _1_ e -[(*-f) 2 /CT 2 -2p(x-fX;v-)j)/CTT-Ky-!j) 2 /r 2 ]/2(l-p 2 ) 

Ixtox^/X — p 2 

where we write (x, y ) for (xi, xi) and (f, rf) for (ai, 02 ), and where a 2 = var(X), 
r 2 = var(T), and pox = cov(X, Y). j 


There is a difference between the transformation groups (4.1), (4.3), and (4.5), 
on the one hand, and (4.13), on the other. In the first three cases, different transfor¬ 
mations of the group lead to different distributions. This is not true of (4.13) since 
the distributions of 

ai + BiU and ai + B 2 U 

coincide provided ai = a^ and B\ B\ = /T B' 2 - This occurs when ai = a 2 and 
(Z?7 B^iB^ 1 B\Y is the identity matrix, that is, when B^ x B\ is orthogonal. The 
same family of distributions can therefore be generated by restricting the matrices 
B in (4.13) to belong to a smaller group. In particular, it is enough to let G be the 
group of lower triangular matrices, in which all elements above the main diagonal 
are zero (Problems 4.7 - 4.9). 

1 When it is not likely to cause confusion, we shall use U and so on to denote both the vector and the 
column with elements t/i. 
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Example 4.6 The linear model. Let us next consider a different generalization 
of a location-scale family. As before, let U = {U\, ..., U„) have a fixed joint 
distribution and consider the transformations 

(4.17) Xj = a,- + bUj , i = 1,..., n, 

where the translation vector a = (cq,..., a„) is restricted to be in some .v-dimen¬ 
sional linear subspace of u-space, that is, to satisfy a set of linear equations 

S 

(4.18) a, = ^ dtjPj 0 = 1,..., n). 

7=1 

Here, the djj are fixed (without loss of generality the matrix D = ( djj ) is assumed 
to be of rank s ) and the /3 ; are arbitrary. 

The most important case of this model is that in which the f/’s are iid as N( 0, 1). 
The joint distribution of the X’s is then given by 

(4.19) J- exp (Xi-cii) 2 

(V2nb) n L 2b- J 

with a ranging over . 

We shall next consider a number of models in which the groups (and hence the 
resulting families of distributions) are much larger than in the situations discussed 
so far. 

Example 4.7 A nonparametric iid family. Let U{,, U n be n independent 
random variables with a fixed continuous common distribution, say /ViO, 1), whose 
support is the whole real line, and let G be the class of all transformations 

( 4 . 20 ) xt = g m 

where g is any continuous, strictly increasing function satisfying 

(4.21) lim g(u) = — oo, lim g(u) = oo. 

u —> —oo u —> oo 

This class constitutes a group. The X, are again iid with common distribution, 
say F g . The class {F g : g e G} is the class of all continuous distributions whose 
support is (—oo, oo), that is, the class of all distributions whose cdf is continuous 
and strictly increasing on (—oo, oo). 

In this example, one may wish to impose on g the additional restriction of 
differentiability for all ji. The resulting family of distributions will be as before 
but restricted to have probability density with respect to Lebesgue measure. j 

Many variations of this basic example are of interest, we shall mention only a 
few. 

Example 4.8 Symmetric distributions. Consider the situation of Example 4.7 
but with g restricted to be odd, that is, to satisfy g{—u) = — g(u ) for all u. This 
leads to the class of all distributions whose support is the whole real line and which 
are symmetric with respect to the origin. If instead we let Xj = g(uj) + a, — oo < 
a < oo, the resulting class is that of all distributions whose support is the real line 
and which are symmetric with the point a of symmetry being specified. j 
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Example 4.9 Continuation of Example 4.7. In Example 4.7, replace N(0 , 1) as 

the initial distribution of the U; with the uniform distribution on (0, 1), and let G 
be the class of all strictly increasing continuous functions g on (0, 1) satisfying 
g(0) = 0, g(l) = 1. If, then, X, = a + bg(Uj) with —oo < a < oo, 0 < b, the 
resulting group family is that of all continuous distributions whose support is an 
interval. j 

The examples of group families considered so far are of two types. In Examples 
4.1 - 4.6, the distributions within a family were naturally indexed by a relatively 
small number of parameters (a and b in Example 4.1; the elements of the matrix 

B and the vector a in Example 4.4; the quantities b and ( J >\ . ft s in Example 

4.6). On the other hand, in Examples 4.7 - 4.9, the distribution of the X, was 
fairly unrestricted, subject only to conditions such as independence, identity of 
distribution, nature of support, continuity, and symmetry. The next example is the 
prototype of a third kind of model arising in survey sampling. 

Example 4.10 Sampling from a finite population. To motivate this model, con¬ 
sider a finite population of N elements (or subjects) to each of which is attached 
a real number (for example, the age or income of the subject) and an identifying 
label. A random sample of n elements drawn from this population constitutes the 
observations. Let the observed values and labels be (Xi, J\), ..., (X„. J n ). The 
following group family provides a possible model for this situation. 

Let iq,..., i>jv be any fixed N real numbers, and let the pairs (Lj. J x ), .. 
((/„, J„) be n of the pairs (tq, 1),..., ( Vn, N) selected at random, that is, in such 
a way that all 

/ N\ 

II possible choices of n pairs are equally likely. 

Finally, let G be the group of transformations 

(4.22) X\ = U\ + cij l ,, X n = U n + cij n 

where the (V-tuple (a i,..., , a n) ranges over all possible (V-tuples —oo < ci\, 
02 , ■ ■ ■, cin < oo. If we put y; = v, + a;, then the pairs (Xi, J\), ..., (X„, J n ) 
are a random sample from the population (vi, 1), ..., (y,y, N ), the y values being 
arbitrary. 

This example can be extended in a number of ways. In particular, the sampling 
method, reflecting some knowledge concerning the population of y values, may be 
more complex. In stratified sampling , for instance, the population of N is divided 
into, say, s subpopulations of A), .... /V s members (EM, = N) and a sample of 
«, is drawn at random from the i th subpopulation (Problem 4.12). This and some 
other sampling schemes will be considered in Section 3.7. A different modification 
places some restrictions on the y’s such as 0 < y, < oo, or 0 < y,- < 1 (Problem 
4.11). || 

It was stated at the beginning of the section that in a group family, the differ¬ 
ent members of the family play a highly symmetric role. However, the general 
construction of such a family V as the distributions of gU, where U has a fixed 
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distribution Pq and g ranges over a group G of transformations, appears to single 
out the distribution Pq of U (which is a member of V since the identity transfor¬ 
mation is a member of G) as the starting point of the construction. This asymmetry 
is only apparent. Let P\ be any distribution of V other than Pq and consider the 
family V of distributions of g V as g ranges over G, where V has distribution P\. 
Since P\ is an element of V, there exists an element go of G for which goU is 
distributed according to P\. Thus, go IJ can play the role of V, and V is the family 
of distributions of ggoU as g ranges over G. However, as g ranges over G, so does 
ggo (Problem 4.5), so that the family of distributions of ggoU, g e G, is the same 
as the family of V of gU, g e G. A group family is thus independent of which of 
its members is taken as starting distribution. 

If one cannot find a group generating a given family V of distributions, the 
question arises whether such a group exists, that is, whether V is a group family. 
In principle, the answer is easy. For the sake of simplicity, suppose that V is a 
family of univariate distributions with continuous and strictly increasing cumula¬ 
tive distribution functions. Let Fq and F be two such cdf’s and suppose that U is 
distributed according to Fq. Then, if g is strictly increasing, g(U) is distributed ac¬ 
cording to F if and only if g = F 1 ( Fq) (Problem 4.14). Thus, the transformations 
generating the family must be the transformations 

(4.23) {F~\Fq), F e V}. 

The family V will be a group family if and only if the transformations (4.23) form 
a group, that is, are closed under composition and inversion. In specific situations, 
the calculations needed to check this requirement may not be easy. For an important 
class of problems, the question has been settled by Borges and Pfanzagl (1965). 


5 Exponential Families 


A family { P„ } of distributions is said to form an .v -dimensional exponential family 
if the distributions Pg have densities of the form 


(5.1) 


p e (x) = exp 


riiWmx) - B(9) 

i =1 


h(x) 


with respect to some common measure //. Here, the rg and B are real-valued 
functions of the parameters and the 7} are real-valued statistics, and x is a point in 
the sample space X, the support of the density. Frequently, it is more convenient 
to use the rg as the parameters and write the density in the canonical form 


(5.2) 


p(x\r]) = exp 


X! r tif(x) - A(g) 

i=i 


h(x). 


It should be noted that the form (5.2) is not unique. We can, for example, multiply 
ig by a constant c if, at the same time, 7} is replaced by 7 )/c. More generally, we 
can make linear transformations of the fs and 7”s. 

Both (5.1) and (5.2) are redundant in that the factor h(x ) could be absorbed into 
/x. The reason for not doing so is that it is then usually possible to take // to be 
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either Lebesgue measure or counting measure rather than having to define a more 
elaborate measure. 

The function p given by (5.2) is non-negative and is therefore a probability 
density with respect to the given /i, provided its integral with respect to // equals 
1. A constant A( p) for which this is the case exists if and only if 


(5.3) 


J e j:r i‘ T ‘ (x >l l ( x ylii( x ') < oo_ 


The set E of points p = (pi,..., p s ) for which (5.3) holds is called the natural 
parameter space of the family (5.2) and rj is called the natural parameter. It 
is not difficult to see that E is convex (TSH2, Section 2.7, Lemma 7). In most 
applications, it turns out to be open, but this need not be the case (Problem 5.1). 
In the parametrization (5.1), the natural parameter space is the set of 6 values for 
which [pfO), ..., p s (9)] is in E. 


Example 5.1 Normal family. If X has the (V(£, cr 2 ) distribution, then 6 = (£, a 1 ) 
and the density with respect to Lebesgue measure is 

1 .2 


Po(x) = exp 


2cr 2 


2 a 2 


s/Tjto 


a two-parameter exponential family with natural parameters (p i, pi) = (£/er 2 , 
— 1 /2er 2 ) and natural parameter space JH x (—oo, 0). i 


Some other examples of one- and two-parameter exponential families are shown 
in Table 5.1. 

If the statistics T\,, T s satisfy linear constraints, the number s of terms in 
the exponent of (5.1) can be reduced. Unless this is done, the parameters p, are 
statistically meaningless; they are unidentifiable (see Problem 5.2). 

Definition 5.2 If X is distributed according to p,, , then 9 is said to be unidentifiable 
on the basis of X if there exist 0\ f 62 for which Pg l = Pg 2 . 

A reduction is also possible when the ij’s satisfy a linear constraint. In the latter 
case, the natural parameter space will be a convex set which lies in a linear subspace 
of dimension less than s. If the representation (5.2) is minimal in the sense that 
neither the T's nor the tfs satisfy a linear constraint, the natural parameter space 
will then be a convex set in E s containing an open s-dimensional rectangle. If 
(5.2) is minimal and the parameter space contains an s -dimensional rectangle, the 
family (5.2) is said to be of full rank. 

Example 5.3 Multinomial. In n independent trials with .v + 1 possible outcomes, 
let the probability of the /th outcome be p, in each trial. If A, denotes the number 
of trials resulting in outcome i (i = 0, 1, • • •, .v), then the joint distribution of the 
A’s is the multinomial distribution M(po, .... p s \ n) 

(5.4) P(X 0 = x 0 ,..., X s =x s )= —-- Pq° ... pfi, 

x 0 ! ■ --x s ! 

which can be rewritten as 


exp(.r 0 log po + ■ ■ ■ + x s log p s )h(x). 
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Table 5.1. Some One- and Two-Parameter Exponential Families 


Density* 

Name 

Notation 

Support 

1 x a - x e~ xla 

T(a)b a 

Gammafd, b) 

r(a,b) 

0 < x < oo 

1 x f/2 ~ x e~ x/2 

T(//2)2// 2 

Chi-squared(/) 

* 2 r 

0 < x < oo 

T{a + b) x°~\\ x) b -' 

r»r(6) 

Beta(a, b) 

B(a, b) 

0 < x < 1 

p x (\ - pf~ x 

Bernoullif p) 

b(p) 

x = 0, 1 

p x (\ - p)"- x 

Binomial(p, n) 

b(p,n) 

x = 0, 1,..., n 

—X x e~ x 

x! 

Poisson(l) 

P0-) 

x = 0, 1,... 


Negative binomialt p, m) 

Nb(p, m) 

x = 0, 1, ... 


"The density of the first three distributions is with respect to Lebesgue measure, 
and that of the last four with respect to counting measure. 


Since the x,- add up to n, this can be reduced to 

(5.5) exp[n log p 0 + x, \og(p t /p 0 ) + ■ ■ ■ + x s log (p s /Po)]h(x). 

This is an .v-dimensional exponential family with 


(5.6) r 7 , = \og(pi/p 0 ), A{rj) = -n log p 0 = n log 


i + £< 


The natural parameter space is the set of all (rji ,..., r/ s ) with —oo < p, < oo. 


In the normal family of Example 5.1, it might be the case that the mean and 
the variance are related. [Such a model can be useful in data analysis, where the 
variance may be modeled as a power of the mean (see, for example, Snedecor and 
Cochran 1989, Section 15.10).] In such cases, when the natural parameters of the 
distribution are related in a nonlinear way, we say that (5.1) or (5.2) forms a curved 
exponential family (see Note 10.6). 


Example 5.4 Curved normal family. For the normal family of Example 5.1, 
assume that £ = a, so that 


pe(x) = exp 




1 


(5.7) 


2j ’ 


| > 0 . 
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Although this is formally a two-parameter exponential family with natural param¬ 
eter (j, — ), this parameter is, in fact, generated by the single parameter ft. The 

two-dimensional parameter (|, — lies on a curve in SH 2 , making (5.7) a curved 
exponential family. j 

The underlying parameter in a curved exponential family need not be one di¬ 
mensional. The following is an example in which it is two dimensional. 

Example 5.5 Logit model. Let X, be independent b(pt , rift), i = \..... m, so that 
their joint distribution is 

m / 

P(X 1 =x u ...,X m =x m ) = Y\ ( 

i=i V ' 

This can be written as 

I m 1 m / \ 

J2 x ‘ io s j n () (i - P‘ )n> • 

an m -dimensional exponential family with natural parameters = log[(/>,/ 
(1 — Pi)], i = 1,. ■ •, m. The quantity log[p/(l — p)] is known as the logit of 
P- 

If the rft s satisfy 

(5.9) Pi = a + fizi, i = 1,..., m, 

for known covariates Zi , the model only contains the two parameters a and ft and 
(5.8) becomes a curved exponential family (see Note 10.6). i 

Note that the parameter space of an s-dimensional curved exponential family 
cannot contain an s -dimensional rectangle, so a curved exponential family is not 
of full rank. Nevertheless, as long as the 7”s are not rank deficient, a curved 
exponential family shares many of the following properties of a full rank family. 
(An exception is completeness of the sufficient statistic, discussed in the next 
section.) A more detailed treatment can be found in Brown (1986a) or Barndorff- 
Nielsen and Cox (1994). 

Let X and Y be independently distributed according to ^'-dimensional exponen¬ 
tial families (not necessarily full rank) with densities 

(5.10) exp[Ep,-7)(.r)- A(rj)]h(x) and exp [E?;, [/,■ (y) - C(p)] k(y) 

with respect to measures /i and v over (X , A) and (J 2 , £>), respectively. Then, 
the joint distribution of X, Y is again an exponential family, and by induction, the 
result extends to the joint distribution of more than two factors. The most important 
special case is that of iid random variables X,-, each distributed according to (5.1): 
The exponential structure is preserved under random sampling. The joint density 
of X = {X\,X n ) is 

(5.11) exp [Y.r] i (9)T'(x) - ijB(6)\ h{xft) ■ ■ ■ h(x n ) 

with T-tx) = E" =| TftXj). 
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Example 5.6 Normal sample. Let X,- (/' = 1,..., n) be iid according to N(%, a 2 ). 
Then, the joint density of Xi ,..., X n with respect to Lebesgue measure in E„ is 

(5.12) exp (--^Zx 2 - ^i; 2 ) ■—L -. 

V 2 2cr 2 ' 2a 1 - J (V2^a)« 

As in the case n = 1 (Example 5.1), this constitutes a two-parameter exponential 
family with natural parameters (f /er 2 , — 1 /2cr 2 ). j 

Example 5.7 Bivariate normal. Suppose that (X,-, Yf), i = 1,..., n, is a sample 
from the bivariate normal density (4.16). Then, it is seen that the joint density of 
the n pairs is a five-parameter exponential density with statistics 

7) = EX;, 7) = LX 2 , 7) = E X, Y,, 7 4 = TV,, T 5 = 'EY 2 . 

This example easily generalizes to the p-variate case (Problem 5.3). i 


A useful property of exponential families is given by the following theorem, 
which is proved, for example, in TSH2 (Chapter 2, Theorem 9) and in Barndorff- 
Nielsen (1978, Section 7.1). 

Theorem 5.8 For any integrable function f and any rj in the interior of E, the 
integral 

(5.13) J f(x)exp[T,r) i T i (x)]h(x)dn,(x) 

is continuous and has derivatives of all orders with respect to the rj’s, and these 
can be obtained by differentiating under the integral sign. 


As an application, differentiate the identity 


/ 


exp[E?]/7j(v) — A(ri)]h(x)dp.(x) = 1 


with respect to rjj to find 

(5.14) EfTf = A(?7). 

dijj 

Differentiating (5.14), in turn, with respect to rik leads to 

3 2 

(5.15) co v(Tj,T k )=——A(r,). 

dr/jdrik 

(For the corresponding formulas in terms of (5.1), see Problem 5.6.) 

Example 5.9 Continuation of Example 5.3. From (5.6), (5.14), and (5.15), one 
easily finds for the multinomial variables of Example 5.3 that (Problem 5.15) 


(5.16) 


E(Xi) = npi, 


co v(Xj, X k ) = 


npj( 1 - Pj ) if k = j 
—npjPk if k fj. 


As will be discussed in the next section, in an exponential family the statistics 
T = (Ti,... ,T S ) carry all the information about i] or 6 contained in the data, so 
that all statistical inferences concerning these parameters will be based on the T's. 
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For this reason, we shall frequently be interested in calculating not only the first 
two moments of the 7”s given by (5.14) and (5.15) but also some of the higher 
moments 

(5.17) « ri . ri = E(T[ 1 ■ ■ ■ r;o 

and central moments 

(5.18) ii r . .„ = E{[T X - E(T x )] r ' • • • [T s - E(T 5 )X’}. 

A tool that often facilitates such calculations is the moment generating function 

(5.19) M t {u u ...,u s )= E(e ll ' Tl+ - +u * T °). 

If Mj exists in some neighborhood Em } < <5 of the origin, then all moments 
a n . rs exist and are the coefficients in the expansion of Mj as a power series 

(5.20) M t (u\,...,u s )= ^2 ot n . rs u\'■ ■ ■ u r s ’/ rf. ■ ■ ■ r s \ 

(n. r,) 

As an alternative, it is sometimes more convenient to calculate, instead, the 

cumulants K n . rj , defined as the coefficients in the expansion of the cumulant 

generating function 

(5.21) K t (u\ ,..., m s ) = log M t (u\, ..., u s ) 

= X! K n,...,r s U\ ■ ■ ■ K’/ r 1 ! • • • r s' 

(r, . r s ) 

From the cumulants, the moments can be determined by formal comparison of 
the two power series (see, for example, Cramer 1946a, p. 186, or Stuart and Ord 
1987, Chapter 3.). For s = 1, one finds, for example (Problem 5.7), 

(5.22) a x =Ki, o.2 = K2 + k\, 0:3 = ^3 + 3k x K2 + 

0?4 = K4 + 3ka + Ak\Kt, + 6k^K2 + k x • 

For exponential families, the moment and cumulant generating functions can 
be expressed rather simply as follows. 

Theorem 5.10 If X is distributed with density (5.2), then for any i] in the interior 
of E, the moment and cumulant generating functions Mj(u) and Kj(u) of the T’s 
exist in some neighborhood of the origin and are given by 

(5.23) Kt(u) = A(t] + u) — A(r]) 
and 

(5.24) M t (u ) = e A( ' 1+u) /e M,l) 
respectively. 

Frequently, the calculation of moments becomes particularly easy when they can 
be represented as the sum of independent terms. We shall illustrate two examples 
for the case ^ = 1 . 

(a) Suppose X = X x + ■ ■ ■ + X n , where the X ; are independent with moment and 
cumulant generating functions M Xi (u) and Kx,(u), respectively. Then 

( * + - +x ")] = Mxi (u)... Mx „(u) 


M x (u) = E[e' 
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and therefore 

n 

K x (u) = 22 K x ,(u). 

i=i 

From the definition of cumulants, it then follows that 

n 

(5.25) A 'r = 'Y2 Kir 

1=1 

where /r,> is the rth cumulant of X ,•. 

(b) The situation is also very simple for low central moments. If = E( A,), of = 
var(Xj) and the X ,■ are independent, one easily finds (Problem 5.7) 

(5.26) var(EZ,) = T,of, E[Y(X, - &)] 3 = ££(X,- - f,) 3 , 

£[E(X ; - ft )] 4 = ££(*,- - I /) 4 + 6 J 2 

i<j 

For the case of identical components with §,■ = of = o 2 , this reduces to 

(5.27) var(EX,-) = no 2 , £[E(Z,- - f)] 3 = nE(X l - ^) 3 , 

£[E(Z ; - |)] 4 = n£(X! - §) 4 + 3n(n - 1 )ct 4 . 


The following are a few of the many important special cases of exponential fam¬ 
ilies and some of their moments. Additional examples are given in the problems; 
see also Johanson (1979), Brown (1986a), or Hoffmann-Jorgensen (1994, Chapter 
12 ). 

Example 5.11 Binomial moments. Let X have the binomial distribution b(p, n) 
so that for .v = 0 , 1 

(5.28) P(X = x)= (^j p x q n ~ x (0 < p < 1; q = 1 - p). 


This is the special case of the multinomial distribution (5.4) with s = 1. The 
probability (5.28) can be rewritten as 

^ e x^og(p/q)+nlogq 

which defines an exponential family, with /x being counting measure over the 
points x = 0 , 1 ,..., n and with 



(5.29) ?7 = log (p/q), A{ij) = n log(l + e n ). 


From (5.24) and (5.29), one finds that (Problem 5.8) 


(5.30) M x (u) = (q + pe i r. 

An easy way to obtain the expectation and the first three central moments of X 
is to use the fact that X arises as the number of successes in n Bernoulli trials with 
success probability p, and hence that X = EX;, where X, is 1 or 0, as the ith trial 
is or is not a success. From (5.27) and the moments of X t , one then finds (Problem 
5.8) 
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(5.31) E(X) = np, E(X — np) 3 = npq(q — p), 

var(X) = npq, E(X — np) 4 = 3 (npq) 2 + npq( 1 — 6 pq). 


Example 5.12 Poisson moments. A random variable X has the Poisson distribu¬ 
tion P(X) if 

X x 

(5.32) P(X=x)=—e~\ x = 0, 1,A > 0. 

x\ 

Writing this as an exponential family in canonical form, we find 

(5.33) rj = log A, A(q ) = X = e’ 1 
and hence 

(5.34) K x (u) = X(e u - 1), M x (u) = 

so that, in particular, K r = X for all r. The expectation and first three central 
moments are given by (Problem 5.9) 

(5.35) E(X) = X, E(X-X) 3 = X, 

var(X) = X, E(A-A) 4 = A + 31 2 . || 


Example 5.13 Normal moments. Let X have the normal distribution (V(£, a 2 ) 
with density 


(5.36) 


1 


-Ot-f) 2 /2a 2 


\f2jta 

with respect to Lebesgue measure. For fixed a, this is a one-parameter exponential 
family with 

(5.37) // = f /a 2 and A(q) = ipa 2 /2 + constant. 


It is thus seen that 

(5.38) M x (u) = e ^ +(l ' 2)a2 “ 2 


and hence in particular that 

(5.39) E() 0 = $. 


Since the distribution of X — £ is N( 0, a 2 ), the central moments p, of X are simply 
the moments a r of /V(0. a 2 ), which are obtained from the moment generating 
function 


M 0 (u ) = e 


a 2 u 2 /2 


to be 

(5.40) P 2 r+] = o, p 2 r = 1 ■ 3 • • • (2r l)cr 2r , r = 1, 2,... . 


Example 5.14 Gamma moments. A random variable X has the gamma distribu¬ 
tion r(cn, b) if its density is 

x a -'e- x/h , x > 0 , a > 0 , b > 0 , 


(5.41) 


T (a)b' 
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with respect to Lebesgue measure on (0, oo). Here, b is a scale parameter, whereas 
a is called the shape parameter of the distribution. For a = // 2 (/an integer), 
b = 2, this is the /^distribution / j with / degrees of freedom. For fixed-shape 
parameter a, (5.41) is a one-parameter exponential family with rj = — 1 /b and 

A(rf) = a log b = —a log(— rj). 

Thus, the moment and cumulant generating functions are seen to be 

(5.42) Mx(u) = (1 — buff 01 and Kx(u) = — alog(l — bu ), u < 1 /b. 

From the first of these formulas, one finds 

F(a + r) 

(5.43) E(X r ) = a(a + 1) • • • (a + r - 1 )b r = — - -b r 

r(a) 

and hence (Problem 5.17) 

(5.44) E(X) = c/b, E(X - abf = 2ab\ 

var(X) = ab 2 , E(X - abf = (3a 2 + 6 a)b 4 . || 

Another approach to moment calculations is to use an identity of Charles Stein, 
which was given a thorough treatment by Hudson (1978). Stein’s identity is pri¬ 
marily used to establish minimaxity of estimators, but it is also useful in moment 
calculations. 

Lemma 5.15 (Stein’s identity) IfX is distributed with density (5.2) and g is any 
differentiable function such that E\g'(X)\ < oo, then 

I hfX) s I 

— + g mT!{X) g(X) = -Eg'(X), 

provided the support ofX is (—oo, oo). If the support ofX is the bounded inten’al 
(a, b), then (5.45) holds i/exp{^ t] l Tj(x)}h(x) —> 0 as x —> a or b. 

The proof of the lemma is quite straightforward and is based on integration by 
parts (Problem 5.18). We illustrate its use in the normal case. 

Example 5.16 Stein’s identity for the normal. If X ~ N(fi, ct 2 ), then (5.45) 
becomes 

E{g(X)(X - /I)) = o 2 Eg'(X). 

This immediately shows that E(X) = n (take g(x) = 1) and E(X 2 ) = o 2 + p? (take 
g(x) = x). Higher-order moments are equally easy to calculate (Problem 5.18). || 

Not only are the moments of the statistics 7} appearing in (5.1) and (5.2) of 
interest but also the family of distributions of the 7”s. This turns out again to be 
an exponential family. 

Theorem 5.17 If X is distributed according to an exponential family with density 
(5.1) with respect to a measure p over(X , A), then T = (7),... , T s ) is distributed 
according to an exponential family with density 

exp [£rjiti - A(r))] k(t ) 


(5.46) 
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with respect to a measure v over E s . 

For a proof, see, for example, TSH2, Section 2.7, Lemma 8. 

Let us now apply this theorem to the case of two independent exponential fami¬ 
lies with densities (5.10). Then it follows from Theorem 5.17 that (T\ + U\, ..., T s + 
U s ) is also distributed according to an s -dimensional exponential family, and by 
induction, this result extends to the sum of more than two independent terms. 
In particular, let X\,..., A„ be independently distributed, each according to a 
one-parameter exponential family with density 

(5.47) exp [rjTjiXi) - A,-(i?)] /j,(x,). 

Then, the sum Y^=i Ti(Xj) is again distributed according to a one-parameter expo¬ 
nential family. In fact, the sum of independent Poisson or normal variables again 
has a distribution of the same type, and the same is true for a sum of independent 
binomial variables with common p, or a sum of independent gamma variables 
T(a/, b ) with common b. 

The normal distributions N(f, cr 2 ) for fixed a constitute both a one-parameter 
exponential family (Example 5.12) and a location family (Table 4.1). It is natural to 
ask whether there are any other families that enjoy this double advantage. Another 
example is obtained by putting X = log Y where Y has the gamma distribution 
T(a, b ) given by (5.41), and where the location parameter 6 is 0 = log b. Since 
multiplication of a random variable by a constant c ^ 0 preserves both the expo¬ 
nential and location structure, a more general example is provided by the random 
variableclog Y for any c A 0. It was shown by Dynkin(1951) and Ferguson (1962) 
that the cases in which X is normal or is equal to clog T, with Y being gamma, 
provide the only examples of exponential location families. 

The T(a, b) distribution, with known parameter a, constitutes an example of 
an exponential scale family. Another example of an exponential scale family is 
provided by the inverse Gaussian distribution (see Problem 5.22), which has been 
extensively studied by Tweedie (1957). For a general treatment of these and other 
results relating exponential and group families, see Barndorff-Nielsen et al. (1992) 
or Barndorff-Nielsen (1988). 

6 Sufficient Statistics 

The starting point of a statistical analysis, as formulated in the preceding sections, 
is a random observable X taking on values in a sample space X, and a family of 
possible distributions of X. It often turns out that some part of the data carries no 
information about the unknown distribution and that A can therefore be replaced by 
some statistic T = T (A) (not necessarily real-valued) without loss of information. 
A statistic T is said to be sufficient for A, or for the family V = {Pg,6 e f2} 
of possible distributions of A, or for 9, if the conditional distribution of A given 
T = t is independent of 9 for all t. 

This definition is not quite precise and we shall return to it later in this section. 
However, consider first in what sense a sufficient statistic T contains all the in¬ 
formation about 9 contained in A. For that purpose, suppose that an investigator 
reports the value of T , but on being asked for the full data, admits that they have 
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been discarded. In an effort at reconstruction, one can use a random mechanism 
(such as a pseudo-random number generator) to obtain a random quantity X' dis¬ 
tributed according to the conditional distribution of X given t. (This would not be 
possible, of course, if the conditional distribution depended on the unknown 9.) 
Then the unconditional distribution of X’ is the same as that of X , that is, 

P 0 (X' e A) = P g (X e A) for all A, 


regardless of the value of 9. Hence, from a knowledge of T alone, it is possible 
to construct a quantity X' which is completely equivalent to the original X. Since 
X and X' have the same distribution for all 9 , they provide exactly the same 
information about 9 (for example, the estimators 8(X) and S(X') have identical 
distributions for any 9). 

In this sense, a sufficient statistic provides a reduction of the data without loss of 
information. This property holds, of course, only as long as attention is restricted 
to the model V and no distributions outside V are admitted as possibilities. Thus, 
in particular, restriction to T is not appropriate when testing the validity of V. 

The construction of X’ is, in general, effected with the help of an independent 
random mechanism. An estimator S(X') depends, therefore, not only on T but 
also on this mechanism. It is thus not an estimator as defined in Section 1, but 
a randomized estimator. Quite generally, if X is the basic random observable, a 
randomized estimator of g(9) is a rule which assigns to each possible outcome x of 
X a random variable Y(x) with a known distribution. When X = x, an observation 
of Y(x) will be taken and will constitute the estimate of g{9). The risk, defined by 
(1.10), of the resulting estimator is then 



L(9 , y)dP Y \x=x(y) 


dP X \e(x), 


where the probability measure in the inside integral does not depend on 9. With this 
representation, the operational significance of sufficiency can be formally stated 
as follows. 


Theorem 6.1 Let X be distributed according to P 0 e 'P and let T be sufficient 
for V. Then, for any estimator 8(X) of g(9), there exists a (possibly randomized) 
estimator based on T which has the same risk function as 8(X). 

Proof Let X' be constructed as above so that S'(X) is an (possibly randomized) 
estimator depending on the data only through T. Since S(X) and 8'(X) have the 
same distribution, they also have the same risk function. □ 


Example 6.2 Poisson sufficient statistic. Let X\,X 2 be independent Poisson 
variables with common expectation X, so that their joint distribution is 

X Xl+X2 

P(Xt = Xl ,X 2 =X 2 )= —-e~ 2 \ 

Xi 1X2 I 

Then, the conditional distribution of X t given X \ + AS = r is given by 


P(Xx = x l \X l +X 2 = t) = 


X’e 2X /x\\(t — x\)\ 
£U^- 2 V.v!(f - y)\ 
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=_I_ ( _I_ 

x\\(t — xi)! \£' v=0 l/;y!(f-;y)! 

Since this is independent of A, so is the conditional distribution given t of {X \, X 2 = 
t — X |), and hence T = X] + X 2 is a sufficient statistic for A. To see how to 
reconstruct (X\, X 2 ) from T, note that 

1 1 , 

2 ___ 2 ' 

y\(t-y)\ t\ 

so that 

that is, the conditional distribution of X\ given t is the binomial distribution 
b( 1/2, t) corresponding to t trials with success probability 1/2. Let X\ and X' 2 = 
t — X\ be respectively the number of heads and the number of tails in t tosses with 
a fair coin. Then, the joint conditional distribution of (X\. X'-,) given t is the same 
as that of (Xj, X 2 ) given t. j 

Example 6.3 Sufficient statistic for a uniform distribution. Let X \,..., X n be 

independently distributed according to the uniform distribution 7/(0, 0). Let T be 
the largest of the n X' s, and consider the conditional distribution of the remaining 
/? — 1 X’s given t. Thinking of the n variables as n points on the real line, it is 
intuitively obvious and not difficult to see formally (Problem 6.2) that the remaining 
n — 1 points (after the largest is fixed at t ) behave like n — 1 points selected at 
random from the interval (0, t). Since this conditional distribution is independent 
of 0, T is sufficient. Given only T = t, it is obvious how to reconstruct the original 
sample: Select n — 1 points at random on (0, t). j 

Example 6.4 Sufficient statistic for a symmetric distribution. Suppose that X is 
normally distributed with mean zero and unknown variance a 2 (or more generally 
that X is symmetrically distributed about zero). Then, given that |X| =t, the only 
two possible values of X are ±f, and by symmetry, the conditional probability of 
each is 1 /2. The conditional distribution of X given t is thus independent of a and 
T = |X| is sufficient. In fact, a random variable X' with the same distribution as 
X can be obtained from T by tossing a fair coin and letting X' = T or —T as the 
coin falls heads or tails. j 

The definition of sufficiency given at the beginning of the section depends on 
the concept of conditional probability, and this, unfortunately, is not capable of a 
treatment which is both general and elementary. Difficulties arise when Pg(T = t) = 
0, so that the conditioning event has probability zero. The definition of conditional 
probability can then be changed at one or more values of t (in fact, at any set of t 
values which has probability zero) without affecting the distribution of X, which 
is the result of combining the distribution of T with the conditional distribution of 
X given T. 

In elementary treatments of probability theory, the conditional probability P(X e 
A\t) is considered for fixed t as defining the conditional distribution of X given 
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T = t. A more general approach can be obtained by a change of viewpoint, namely 
by considering P(X e A\t) for fixed A as a function of f, defined in such a way 
that in combination with the distribution of T, it leads back to the distribution 
of X. (See TSH2, Chapter 2, Section 4 for details.) This provides a justification, 
for instance, of the assignment of conditional probabilities in Example 6.4 and 
Example 6.10. 

In the same way, the conditional expectation qit ) = E[i>(X)\t] can be defined in 
such a way that 

(6.1) Ep(T) = ES(X), 

that is, so that the expected value of the conditional expectation is equal to the 
unconditional expectation. 

Conditional expectation essentially satisfies the usual laws of expectation. How¬ 
ever, since it is only determined up to sets of probability zero, these laws can only 
hold a.e. More specifically, we have with probability 1 

E[af(X ) + bg(X)\t] = aE[f(X)\t] + bE[g(X)\t] 

and 

(6.2) E[b(T)f(X)\t] = b(t)E[f(X)\t], 

As just discussed, the functions P( A \ t) are not uniquely defined, and the question 
arises whether determinations exist which, for each fixed r, define a conditional 
probability. It turns out that this is not always possible. [See Romano and Siegel 
(1986), who give an example due to Ash (1972). A more detailed treatment is 
Blackwell and Ryll-Nardzewsky (1963).] It is possible when the sample space is 
Euclidean, as will be the case throughout most of this book (see TSH2, Chapter 
2, Section 5). When this is the case, a statistic T can be defined to be sufficient if 
there exists a determination of the conditional distribution functions of X given t 
which is independent of 0. 

The determination of sufficient statistics by means of the definition is incon¬ 
venient since it requires, first, guessing a statistic T that might be sufficient and, 
then, checking whether the conditional distributions of X given t is independent of 
6. However, for dominated families, that is, when the distributions have densities 
with respect to a common measure, there is a simple criterion for sufficiency. 

Theorem 6.5 (Factorization Criterion) A necessary and sufficient condition for 
a statistic T to be sufficient for a family V = {Pg , 6 e £2} of distributions of X 
dominated by a a-finite measure p is that there exist non-negative functions gg 
and h such that the densities pg of Pg satisfy 

(6.3) pg(x) = gg[T(x)]h(x) (a.e.p). 

Proof. See TSH2, Section 2.6, Theorem 8 and Corollary 1. 

Example 6.6 Continuation of Example 6.2. Suppose that X i, 
according to a Poisson distribution with expectation X. Then 

PdX i =x u ...,X n = x n ) = X Tx ‘e- nX /Yl( Xi \). 

This satisfies (6.3) with T = £ A, , which is therefore sufficient. 


□ 

..., X n are iid 
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Example 6.7 Normal sufficient statistic. Let X\,..., X n he iid as /V(£. a 2 ) so 
that their joint density is 


(6.4) 


P$A X ) 


1 





-K'Zx; - 


2 cr 2 



Then it follows from the factorization criterion that T = CEXj, EX,) is sufficient 
for 6 = (£, cr 2 ). Sometimes it is more convenient to replace T by the equivalent 
statistic r = (X, S 2 ) where X = EX,/n and S 2 = E(A, - X) 2 = EX 2 - nX 2 . 
The two representations are equivalent in that they identify the same points of the 
sample space, that is, T(x ) = T(y) if and only if T'(x) = T'(y). || 


Example 6.8 Continuation of Example 6.3. The joint density of a sample 
Xi ,..., X n from U( 0, 6) is 

(6.5) p e (x) = 2 /(o < x i )I{x l < 9) 

1 = 1 

where the indicator function, /(•) is defined in (2.6). Now 

n n 

[~[ 7(0 < *,■)/(*,• < 0) = 7(x( m) < 0) ]~J 7(0 < xi) 

i=i i=i 

where X( nj is the largest of the x values. It follows from Theorem 6.5 that X (n) is 
sufficient, as had been shown directly in Example 6.3. j 


As a final illustration, consider Example 6.4 from the present point of view. 

Example 6.9 Continuation of Example 6.4. If X is distributed as N( 0, a 2 ), the 
density of X is 

1 

s/Tjxg 

which depends on x only through x 2 , so that (6.3) holds with T(x ) = x 2 . As always, 
of course, there are many equivalent statistics such as |X|, X 4 or e x . | 

Quite generally, two statistics, T = T(X) and T' = T'(X), will be said to be 
equivalent (with respect to a family V of distributions of X) if each is a function 
of the other a.e. V, that is, if there exists a 'P-null set N and functions / and g 
such that T(x) = f[T'(x)] and T\x) = g[T(x)] for all x e N. Two such statistics 
carry the same amount of information. 

Example 6.10 Sufficiency of order statistics. Let X = (A|,..., X„) be iid ac¬ 
cording to an unknown continuous distribution F and let T = (Xa,, ..., X (nj ) 
where X( t) < • • • < X (nj denotes the ordered observations, the so-called order 
statistics. By the continuity assumptions, the XX are distinct with probability 1. 
Given 7’, the only possible values for X are the n\ vectors (X iil} , • • -, X (iri) ). and 
by symmetry, each of these has conditional probability 1/n! The conditional dis¬ 
tribution is thus independent of F , and T is sufficient. In fact, a random vector 
X’ with the same distribution as X can be obtained from T by labeling the n 
coordinates of T at random. Equivalent to T is the statistic U = (U\..... tJ n ) 



1 . 6 ] 


SUFFICIENT STATISTICS 


37 


where U\ = EX;, U 2 = EX/Xj (i f j), ...,{/„ = X\ ■ ■ ■ X n , and also the statistic 
V = (Vi,..., V„) where Vk = X\ + ■ ■ ■ + X k n (Problem 6.9). || 

Equivalent forms of a sufficient statistic reduce the data to the same extent. 
There may, however, also exist sufficient statistics which provide different degrees 
of reduction. 

Example 6.11 Different sufficient statistics. Let X \,..., X n be iid as /V(0. a 1 ) 
and consider the statistics 

T l (X) = (X u ...,X n ), 

T 2 (X) = (. Xj,..., X 2 n ), 

T 3 (X) = (X 2 + --- + X 2 m ,X 2 m+1 + --- + X 2 n ), 

T 4 (X) = x 2 +-+x 2 n . 

These are all sufficient (Problem 6.5), with 7) providing increasing reduction of 
the data as i increases. 1 


It follows from the interpretation of sufficiency given at the beginning of this 
section that if T is sufficient and T = H(U), then U is also sufficient. Knowledge 
of U implies knowledge of T and hence permits reconstruction of the original 
data. Furthermore, T provides a greater reduction of the data than U unless H 
is 1:1, in which case T and U are equivalent. A sufficient statistic T is said to 
be minimal if of all sufficient statistics it provides the greatest possible reduction 
of the data, that is, if for any sufficient statistic U there exists a function H such 
that T = H(U) (a.e. V). Minimal sufficient statistics can be shown to exist under 
weak assumptions (see, for example, Bahadur, 1954), but exceptions are possible 
(Pitcher 1957, Landers and Rogge 1972). Minimal sufficient statistics exist, in 
particular if the basic measurable space is Euclidean in the sense of Example 2.2 
and the family V of distributions is dominated (Bahadur 1957). 

It is typically fairly easy to construct a minimal sufficient statistic. For the sake 
of simplicity, we shall restrict attention to the case that the distributions of V all 
have the same support (but see Problems 6.11 - 6.17). 


Theorem 6.12 Let V be a finite family with densities pi , i = 0, 1 , ,k, all having 
the same support. Then, the statistic 


( 6 . 6 ) 


T(X) = 


( pi(X) p2(X) Pk(X) \ 
\po(X)’ poiX)”"’ po(X)J 


is minimal sufficient. 


The proof is an easy consequence of the following corollary of Theorem 6.5 
(Problem 6 . 6 ). 


Corollary 6.13 Under the assumptions of Theorem 6.5, a necessary and sufficient 
condition for a statistic U to be sufficient is that for any fixed 6 and 6 q, the ratio 
pe(x)/pe 0 (x) is a function only ofU(x). 

Proof of Theorem 6.12. The corollary states that U is a sufficient statistic for V if 
and only if T is a function of U, and this proves T to be minimal. □ 
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Theorem 6.12 immediately extends to the case that V is countable. Generaliza¬ 
tions to uncountable families are also possible (see Lehmann and Scheffe 1950, 
Dynkin 1951, and Barndorff-Nielsen, Hoffmann-Jorgensen, and Pedersen 1976), 
but must contend with measure-theoretic difficulties. In most applications, min¬ 
imal sufficient statistics can be obtained for uncountable families by combining 
Theorem 6.12 with the following lemma. 

Lemma 6.14 IfV is a family of distributions with common support and Vo C V, 
and if T is minimal sufficient for Vo and sufficient for V, it is minimal sufficient 
forV. 

Proof If U is sufficient for V, it is also sufficient for Vo, and hence T is a function 
of U. □ 

Example 6.15 Location families. As an application, let us now determine mini¬ 
mal sufficient statistics for a sample X\ . X„ from a location family V, that is, 

when 

(6.7) Pe (x)= f( Xl -e)---f(x n -9), 

where / is assumed to be known. By Example 6.10, sufficiency permits the rather 
trivial reduction to the order statistics for all /. However, this reduction uses only 
the iid assumption and neither the special structure (6.7) nor the knowledge of /. 
To illustrate the different possibilities that arise when this knowledge is utilized, 
we shall take for / the six densities of Table 4.1, each with b = 1. 

(i) Normal. If Vo consists of the two distributions N(9o , 1) and N(6\ , 1), it 
follows from Theorem 6.12 that the minimal sufficient statistic for Vo is 
T(x) = pefX)/pe 0 {X), which is equivalent to X. Since X is sufficient for 
V = {N(6, 1), —oo < 6 < oo} by the factorization criterion, it is minimal 
sufficient. 

(ii) Exponential. If the X’s are distributed as E(6 , 1), it is easily seen that X(i) is 
minimal sufficient (Problem 6.17). 

(iii) Uniform. For a sample from U(9 — 1/2,6 + 1/2), the minimal sufficient 
statistic is (Xq), X {n) ) (Problem 6.16). 

In these three instances, sufficiency was able to reduce the original n -di¬ 
mensional data to one or two dimensions. Such extensive reductions are not 
possible for the remaining three distributions of Table 4.1. 

(iv) Logistic. The joint density of a sample from L(0. 1) is 

( 6 . 8 ) p e {x) = exp[— £(x; - 0)]/]“[{l + exp [-fe - 6)]} 2 . 


Consider a subfamily Vo consisting of the distribution (6.8) with do = 0 and 
0i, ... ,6k- Then by Theorem 6.12, the minimal sufficient statistic for Vo is T( X) = 
[TfX) . T k (X)f where 


(6.9) 


T,(x) 


n6, 


n 

i=i 


1 + e 


1 + e 


-Xj+6; 


We shall now show that for k = n + 1, T(X) is equivalent to the order statistics, 
that is, that T(x) = T (j) if and only if x = (xi ,..., x n ) and y = (yu .. •, y n ) have 
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the same order statistics, which means that one is a permutation of the other. The 
equation Tj(x) = Tj(y) is equivalent to 

n / 1 + exp(—x/) \ 2 = / 1 + exp(—y,) \ 2 

V1 + expf-x,- +0j)J v 1 + ex P (—V; + 9j)J 
and hence T(x ) = T(y) to 


( 6 . 10 ) 


n l + %Uj 
1 + 11 


n 


n 


l+gtij 

1 + V, 


for £ - §i,..., £n+i, 


where §j = e e >, Hi = e~ Xi , and v, = e~ y ‘. Now the left- and right-hand sides of 
( 6 . 10 ) are polynomials in § of degree n which agree for n + 1 values of f if and 
only if the coefficients of § r agree for all r = 0, 1,..., n. For r = 0, this implies 
n(l + Uj) = 11(1 + Vi), so that (6.10) reduces to n(l + fw,) = Fl(l + £u,) for 
£ = , ..., f„+ 1 , and hence for all §. It follows that n()] + «,) = n(i? + vf) for all 

i], so that these two polynomials in i] have the same roots. Since this is equivalent 
to the x’s and v’s having the same order statistics, the proof is complete. 

Similar arguments show that in the Cauchy and double exponential cases, too, 
the order statistics are minimal sufficient (Problem 6.10). This is, in fact, the typical 
situation for location families, examples (i) through (iii) being happy exceptions. 


As a second application of Theorem 6.12 and Lemma 6.1, let us determine 
minimal sufficient statistics for exponential families. 

Corollary 6.16 (Exponential Families) Let X be distributed with density (5.2). 
Then, T = (T\, ..., T s ) is minimal sufficient provided the family (5.2) satisfies one 
of the following conditions: 

(i) It is of full rank. 

(ii) The parameter space contains 5+1 points = 0, ..., s), which span E s , 
in the sense that they do not belong to a proper affine subspace of E s . 

Proof. That T is sufficient follows immediately from Theorem 6.5. To prove min¬ 
imality under assumption (i), let Vo be a subfamily consisting of 5 +1 distributions 

rffi = (. ryp), j = 0, 1,..., s. Then, the minimal sufficient statistic for Vo 

is equivalent to 

S(/;, (1) - ^)Ti(X) .E (rif - ij^TiiX), 

which is equivalent to T = |Ti(X),..., T'j(Z)], provided the s x s matrix \ \r]f > — 
11 is nonsingular. A subfamily Vo for which this condition is satisfied exists 
under the assumption of full rank. 

The proof of minimality under assumption (ii) is similar. □ 

It is seen from this result that the sufficient statistics T of Examples 6.6 and 6.7 
are minimal. The following example illustrates the applicability of part (ii). 

Example 6.17 Minimal sufficiency in curved exponential families. Let X \, Xj_, 

..., X n have joint density (6.4), but, as in Example 5.4, assume that £ = <r, so 
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the parameter space is the curve of Figure 10.1 (see Note 10.6). The statistic 
T = iff X/, Xj) is sufficient, and it is also minimal by Corollary 6.16. To see 
this, recall that the natural parameter is i] = (l/§, —l/ 2 f 2 ), and choose 

n (0) = ( 1.4). = 1 (2) = ( >,-£) 

and note that the 2 x 2 matrix 

/ 2 — 1 3- 1 \ 

l-I+I _±+l) 

V 8 T 2 18 2/ 

has rank 2 and is invertible. 

In contrast, suppose that the parameters are restricted according to f = a 2 , 
another curved exponential family. This defines an affine subspace (with zero 
curvature) and the sufficient statistic T is no longer minimal (Problem 6.20). || 

Let X\, ... ,X n be iid, each with density (5.2), assumed to be of full rank. 
Then, the joint distribution of the XX is again full-rank exponential, with T = 
(T*,..., T*) where T* = 7j(X ; ). This shows that in a sample from the 

exponential family (5.2), the data can be reduced to an s -dimensional sufficient 
statistic, regardless of the sample size. 

The reduction of a sample to a smaller number of sufficient statistics greatly 
simplifies the statistical analysis, and it is therefore interesting to ask what other 
families permit such a reduction. The dimensionality of a sufficient statistic is a 
property which differs from those considered so far, in that it depends not only 
on the sets of points of the sample space for which the statistic takes on the same 
value but it also depends on these values; that is, the dimensionality may not be the 
same for different representations of a sufficient statistic (see, for example, Denny, 
1964, 1969). To make the concept of dimensionality meaningful, let us call T a 
continuous s-dimensional sufficient statistic over a Euclidean sample space X if 
the assumptions of Theorem 6.5 hold, if T(x) = [7j (x), ..., T s (x)] where T is 
continuous, and if the factorization (6.3) holds not only a.e. but for all x e X. 

Theorem 6.18 Suppose X \, ..., X n are real-valued iid according to a distribution 
with density fg(xi) with respect to Lebesgue measure, which is continuous in Xi 
and whose support for all 6 is an inten’al I. Suppose that for the joint density of 
X = (X l ,...,X n ) 

Pe(x) = fe(x i) • • • fe(x n ) 

there exists a continuous k-dimensional sufficient statistic. Then 

(i) ifk = 1, there exist functions i]\, B and h such that (5.1) holds; 

(ii) k > 1, and if the densities fg(xi) have continuous partial derivatives with 
respect to Xi, then there exist functions i)j, B and h such that (5.1) holds with 
s < k. 

For a proof of this result, see Barndorff-Nielsen and Pedersen (1968). A corre¬ 
sponding problem for the discrete case is considered by Andersen (1970a). 

This theorem states essentially that among “smooth” absolutely continuous fam¬ 
ilies of distributions with fixed support, exponential families are the only ones that 
permit dimensional reduction of the sample through sufficiency. It is crucial for 
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this result that the support of the distributions Pg is independent of 0. In the con¬ 
trary case, a simple example of a family possessing a one-dimensional sufficient 
statistic for any sample size is provided by the uniform distribution (Example 6.3). 

The Dynkin-Ferguson theorem mentioned at the end of the last section and 
Theorem 6.18 state roughly that (a) the only location families which are one¬ 
dimensional exponential families are the normal and log of gamma distributions 
and (b) only exponential families permit reduction of the data through sufficiency. 
Together, these results appear to say that the only location families with fixed 
support in which a dimensional reduction of the data is possible are the normal and 
log of gamma families. This is not quite correct, however, because a location family 
— although it is a one-dimensional family — may also be a curved exponential 
family. 


Example 6.19 Location/curved exponential family. Let X \,..., X„ he iid with 
joint density (with respect to Lebesgue measure) 


( 6 . 11 ) 


C exp 


^2(xi -e) 4 

i=i 


= C exp(—n$ 4 ) exp(40 3 Ex; — 60 2 T,x 2 +4 O'Exf — Ex, 4 ). 


According to (5.1), this is a three-dimensional exponential family, and it provides 
an example of a location family with a three-dimensional sufficient statistic sat¬ 
isfying all the assumptions of Theorem 6.18. This is a curved exponential family 
with parameter space 0 = {(0 1; 02 , 0 3) : 0\ = 0 1, 0i = 0 f}, a curved subset of 
three-dimensional space. J 


The tentative conclusion, which had been reached just before Example 6.19 
and which was contradicted by this example, is nevertheless basically correct. 
Typically, a location family with fixed support (—00, 00) will not constitute even a 
curved exponential family and will, therefore, not permit a dimensional reduction 
of the data without loss of information. 

Example 6.15 shows that the degree of reduction that can be achieved through 
sufficiency is extremely variable, and an interesting question is, what characterizes 
the situations in which sufficiency leads to a substantial reduction of the data? The 
ability of a sufficient statistic to achieve such a reduction appears to be related 
to the amount of ancillary information it contains. A statistic V(X) is said to 
be ancillary if its distribution does not depend on 0, and first-order ancillary if 
its expectation Eg[V{X)] is constant, independent of 0. An ancillary statistic by 
itself contains no information about 0, but minimal sufficient statistics may still 
contain much ancillary material. In Example 6.15(iv), for instance, the differences 
X(„) — X(f)(i = \n — 1) are ancillary despite the fact that they are functions 
of the minimal sufficient statistics (Xji), ..., X( n) ). 

Example 6.20 Location ancillarity. Example 6.15(iv) is a particular case of a 
location family. Quite generally, when sampling from any location family, the 
differences X, — Xj, i A j, are ancillary statistics. Similarly, when sampling from 
scale families, ratios are ancillary. See Problem 6.34 for details. j 
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A sufficient statistic T appears to be most successful in reducing the data if 
no nonconstant function of T is ancillary or even first-order ancillary, that is, if 
Ee[f(T)] = c for all 9 e Q implies /(f) = c (a.e. V). By subtracting c, this 
condition is seen to be equivalent to 

(6.12) E g [f(T )] = 0 for all 9 e Q, implies /(f) = 0(a.e. V) 

where V = {Pg, 9 e f2}. A statistic T satisfying (6.12) is said to be complete. As 
will be seen later, completeness brings with it substantial simplifications of the 
statistical situation. 

Since complete sufficient statistics are particularly effective in reducing the data, 
it is not surprising that a complete sufficient statistic is always minimal. Proofs are 
given in Lehmann and Scheffe (1950), Bahadur (1957), and Schervish (1995); see 
also Problem 6.29. 

What happens to the ancillary statistics when the minimal sufficient statistic is 
complete is shown by the following result. 

Theorem 6.21 (Basil’s Theorem) If T is a complete sufficient statistic for the 
family V = {Pg, 9 e L!}, then any ancillary statistic V is independent ofT. 

Proof If V is ancillary, the probability p A = P(V e A) is independent of 0 
for all A. Let r] A (t) = P(V e A\T = t). Then, Eg[ii A (T)] = p A and, hence, by 
completeness, 

r\ a( 0 = PA(a.e. V). 

This establishes the independence of V and T. □ 

We conclude this section by examining some complete and incomplete families 
through examples. 

Theorem 6.22 If X is distributed according to the exponential family (5.2) and 
the family is of full rank, then T = [T’/X), ..., 7) (A)] is complete. 

For a proof, see TSH2 Section 4.3, Theorem 1; Barndorff-Nielsen (1978), 
Lemma 8.2.; or Brown (1986a), Theorem 2.12. 

Example 6.23 Completeness in some one-parameter families. We give some 
examples of complete one-parameter families of distributions. 

(i) Theorem 6.22 proves completeness of 

(a) X for the binomial family {b(p, n), 0 < p < 1} 

(b) X for the Poisson family { PO,). 0 < 1} 

(ii) Uniform. Let X\,..., X„ be iid according to the uniform distribution U(0,9), 
0 < 6. It was seen in Example 6.3 that T = X tnl is sufficient for 6. To see that 
T is complete, note that 

P(T <t) = t n /9 n , 0 < t < 9, 

so that T has probability density 

po(t) = nt n ~ l /9 n , 0 < t < 9. 


(6.13) 
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Suppose Egf(T) = 0 for all 9 , and let f + and / be its positive and negative 
parts, respectively. Then, 


f 

Jo 


t n ~ l f + (t)dt 


f 

Jo 


t n ~ l r 


(t)dt 


for all 9. It follows that 


L 


t n 1 f + (t)dt 


L 


t n '/ (t)dt 


for all Borel sets A, and this implies / = 0 a.e. 

(iii) Exponential. Let Y\,, Y„ be iid according to the exponential distribution 
E(i), 1). If Xj = e~ Y ‘ and 9 = e~'K then X\,... ,X n iid as 1/(0, 9) (Problem 
6.28), and it follows from (ii) that X( n) or, equivalently, T ( p is sufficient and 
complete. ] 


Example 6.24 Completeness in some two-parameter families. 

(i) Normal N(%, o 2 ). Theorem 6.22 proves completeness of (X. S 2 ) of Example 
6.7 in the normal family { N(i if, o 2 ), — oo < $ < oo, 0 < a}. 

(ii) Exponential E(a,b). Let X { ,..., X„ be iid according to the exponential 
distribution E(a,b), — oo < a < oo, 0 < b , and let 7) = X^.To = 
E[X, — X(i)]. Then, (7), 7i) are independently distributed as E(a. b/n) and 
^bxln- 2- respectively (Problem 6.18), and they are jointly sufficient and com¬ 
plete. Sufficiency follows from the factorization criterion. To prove complete¬ 
ness, suppose that 

E a .b[f(T\, T 2 )] = 0 for all a, b. 

Then if 

(6.14) g(t u b) = E b [f(t u T 2 )l 

we have that for any fixed b, 

g{t\, b)e~ n 'd h dti = 0 for all a. 

It follows from Example 6.23(iii) that 

g(h,b) = 0, 

except on a set Nb of t\ values which has Lebesgue measure zero and which 
may depend on b. Then, by Fubini’s theorem, for almost all t\ we have 

g(t\, b) = 0 a.e. in b. 

Since the densities of 7) constitute an exponential family, g(t \, b) by (6.14) 
is a continuous function of b for any fixed t\. It follows that for almost all 
t\, g(t\. b) = 0, not only a.e. but for all b. Applying completeness of 7i to 
(6.14), we see that for almost all t\, f{t\, to) = 0 a.e. in t 2 . Thus, finally, 
f(t \, tY) = 0 a.e. with respect to Lebesgue measure in the (t \, t 2 ) plane. [For 
measurability aspects which have been ignored in this proof, see Lehmann 
and Scheffe (1955, Theorem 7.1).] || 
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Example 6.25 Minimal sufficient but not complete. 

(i) Location uniform. Let Xi,..., X„ be iid according to U(9 — 1/2,9 + 1/2), 
—oo < 9 < oo. Here, T = {X ( i), X (n /) is minimal sufficient (Problem 6.16). 
On the other hand, T is not complete since X(„) — X (1) is ancillary. For example, 
Eg[X ( ,,) — X(D — (n — 1)/(« + 1)J = 0 for all 9. 

(ii) Curved normal family. In the curved exponential family derived from the 

N(^, o 2 ) family with £ = cr, we have seen (Example 6.17) that the statistic 
T = (ff Xj , xf) is minimal sufficient. However, it is not complete since 
there exists a function f(T) satisfying (6.12). This follows from the fact that 
we can find unbiased estimators for £ based on either V X; or V X? (see 
Problem 6.21). || 

We close this section with an illustration of sufficiency and completeness in 
logit dose-response models. 

Example 6.26 Completeness in the logit model. For the model of Example 5.5, 
where X, are independent b{p-,, nf), i = l,..., m, that is, 

(6.15) P(X] X m = x m ) = fj j pf(\ - Pi r~ x ‘, 

it can be shown that X = (Xi, ■ ■ ■, X m ) is minimal sufficient. The natural param¬ 
eters are the logits r/j = log[(p,-/(I — pf)], i = 1, ..., w [see (5.8)], and if the pfs 
are unrestricted, the minimal sufficient statistic is also complete (Problem 6.23). || 

Example 6.27 Dose-response model. Suppose subjects are each given dose 
level dj of a drug, i = 1,2, and that d\ < dn- The response of each subject is either 
0 or 1, independent of the others, and the probability of a successful response is 
Pi = rje(dj). The joint distribution of the response vector X = (Xi, Xf) is 

(6.16) p 0 (x) = fj ) [m(di)f‘ [1 - >lo(d,)r- Xi . 

Note the similarity to the model (6.15). 

The statistic X is minimal sufficient in the model (6.16), and remains so if rigid,) 
has the form 

(6.17) rj e (dj) = 1 - e~ 0dl , d x = 1, d 2 = 2, m = 2, n 2 = 1. 

However, it is not complete since 

(6.18) Eg [/(X, = 0) - /(X 2 = 0)] = 0. 

If instead of (6.17), we assume that ijgidj) is given by 

(6.19) rjgidi) = 1 - e - 6,d ‘- 0ld ‘, i = 1,2, 

where d\/d 2 is an irrational number, then X is a complete sufficient statistic. 

These models are special cases of those examined by Messig and Strawderman 
(1993), who establish conditions for minimal sufficiency and completeness in a 
large class of dose-response models. j 
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Table 7.1. Convex Functions 


Function <p 

Interval (a.b) 

(i) \x\ 

— OO < X < oo 

(ii) v 2 

—OO < X < oo 

(iii) x p , p > 1 

0 < * 

(iv) l/x p , p > 0 

0 < x 

(v) e* 

—OO < X < oo 

(vi) log x 

0 < x < oo 


7 Convex Loss Functions 

The property of convexity and the associated property of concavity play an impor¬ 
tant role in point estimation. In particular, the point estimation problem outlined 
in Section 1 simplifies in a number of ways when the loss function L(0. d) is a 
convex function of d. 

Definition 7.1 A real-valued function (p defined over an open interval I = (a, b) 
with —oo < a < b < oo is convex if for any a < x < y < b and any 0 < y < 1 

(7.1) (p[yx + (1 - y)y] < ycp(x) + (1 - y)<p(y). 

The function is said to be strictly convex if strict inequality holds in (7.1) for all 
indicated values of x, y, and y. A function <p is concave on (a, b) if —<p is convex. 

Convexity is a very strong condition which implies, for example, that cp is con¬ 
tinuous in (a, b ) and has a left and right derivative at every point of (a, b ). Proofs 
of these properties and of the other properties of convex functions stated in the fol¬ 
lowing without proof can be found, for example, in Hardy, Littlewood, and Polya 
(1934), Rudin (1966), Roberts and Varberg (1973), or Dudley (1989). 

Determination of whether or not a function is convex is often easy with the help 
of the following two criteria. 

Theorem 7.2 

(i) If (p is defined and differentiable on (a, b), then a necessary and sufficient 
condition for (p to be convex is that 

(7.2) cp'(x) < <p'(y ) for all a < x < y < b. 

The function is strictly convex if and only if the inequality (7.2) is strict for 
all x < y. 

(ii) If in addition, (p is twice differentiable, then the necessary and sufficient 
condition (7.2) is equivalent to 

<p"(x) > 0 for all a < x < b 


(7.3) 
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with strict inequality sufficient (but not necessary) for strict convexity. 

Example 7.3 Convex functions. From these criteria, it is easy to see that the 
functions of Table 7.1 are convex over the indicated intervals: In all these cases, <p 
is strictly convex, except in (i) and in (iii) with p = 1. 

In general, a convex function is strictly convex unless it is linear over some 
subinterval of (a, b ) (Problems 7.1 and 7.6). 

A basic property of convex functions is contained in the following theorem. 

Theorem 7.4 Let (j) be a convex function defined on I = (a, b) and let t be any 
fixed point in I. Then, there exists a straight line 

(7.4) y = L(x) = c(x — t) + tp(t) 
through the point [t , (pit )] such that 

(7.5) L(x) < <p(x) for all x in I. 

By definition, a function </> is convex if the value of the function at the weighted 
average of two points does not exceed the weighted average of its values at these 
two points. By induction, this is easily generalized to the average of any finite 
number of points (Problem 7.8). In fact, the inequality also holds for the weighted 
average of any infinite set of points, and in this general form, it is known as Jensen’s 
inequality. 

The weighted average of <p with respect to the weight function A is represented 
by 

(7.6) J fid A 

where A is a measure with Ail) = 1. In the particular case that A assigns measure 
y and 1 — y to the points x and y, respectively, this reduces to the right side of 
(7.1). It is convenient to interpret (7.6) as the expected value of <p(X), where X is 
a random variable taking on values in I according to the probability distribution 
A. 

Theorem 7.5 (Jensen’s Inequality) Iftp is a convex function defined over an open 
inten’al I, and X is a random variable with P(X e I) = 1 and finite expectation, 
then 

(7.7) fi[E(X)] < E(cP(X)]. 

Iftp is strictly convex, the inequality is strict unless X is a constant with probability 
1 . 

Proof Let v = L(x) be the equation of the line which satisfies (7.5) and for which 
L(t) = cp(t) when t = E(X). Then, 

(7.8) E[tP(X)] > E[L(X)) = L[E(X)] = <p[E{X )], 

which proves (7.7). If tp is strictly convex, the inequality in (7.5) is strict for all 
x f t, and hence the inequality in (7.8) is strict unless tp(X) = E[tp(X)\ with 
probability 1. □ 

Note that the theorem does not exclude the possibility that E[tp(X)] 


= oo. 
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Corollary 7.6 If X is a nonconstant positive random variable with finite expec- 

—— < E 

E(X) \Xj 

EdogX) < log[E(X)]. 

Example 7.7 Entropy distance. For density functions / and g, we define the 
entropy distance between / and g , with respect to / (also known as Kullback- 
Leibler Information of g at f or Kullback-Leibler distance between g and f) 
as 

(7.11) E f [log(f(X)/g(X))] = I log [f(x)/g(x)\f(x)dx. 

Corollary 7.6 shows that 

E f [log(f(X)/g(X))] = -E f [log(g(X)/f(X))] 

(7.12) >-log[ E f (g(X)/f(X))] 

= 0 , 

and hence that the entropy distance is always non-negative, and equals zero if 
/ = g. Note that inequality (7.12) also establishes 

(7.13) E f log[g(X)]<E f \og[f(X)], 

which plays an important role in the theory of the EM algorithm of Section 6.4. 

Entropy distance was explored by Kullback (1968); for an exposition of its 
properties see, for example. Brown (1986a). Entropy distance has, more recently, 
found many uses in Bayesian analysis, see e.g., Berger (1985) or Robert (1994a), 
and Section 4.5. j 

In Theorem 6.1, it was seen that if T is a sufficient statistic, then for any statistical 
procedure there exists an equivalent procedure (i.e., having the same risk function) 
based only on T. We shall now show that in estimation with a strictly convex loss 
function, a much stronger statement is possible: Given any estimator 8(X) which 
is not a function of T , there exists a better estimator depending only on T. 

Theorem 7.8 (Rao-Blackwell Theorem) Let X be a random observable with 
distribution Pg e V = {/V, 9' € and let T be sufficient for V. Let S be an 
estimator of an estimand g(9), and let the loss function L(9, d) be a strictly convex 
function of d. Then, if 8 has finite expectation and risk, 

R(6, 8) = EL[9, 5(X)] < oo, 

and if 

(7.14) nit) = E[8(X)\t], 
the risk of the estimator fiT) satisfies 

(7.15) R(9, rf) < R(9, 8) 


tation, then 

(7.9) 

and 

(7.10) 


unless 8(X) = r](T) with probability 1. 
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Proof. In Theorem 7.5, let f(d) = L(6, cl), let 8 = S(X), and let X have the 
conditional distribution P x ^ of X given T = t. Then 

L[e,r,(t)] < E{L[9,8{X)]\t) 

unless S(X) = i]( T) with probability 1. Taking the expectation on both sides of this 
inequality yields (7.15), unless S(X) = i](T) with probability 1. □ 

Some points concerning this result are worth noting. 

1. Sufficiency of T is used in the proof only to ensure that i](T ) does not depend 
on 0 and hence is an estimator. 

2. If the loss function is convex but not strictly convex, the theorem remains true 
provided the inequality sign in (7.15) is replaced by <. Even in that case, the 
theorem still provides information beyond the results of Section 6 because it 
shows that the particular estimator t](T) is at least as good as S(X). 

3. The theorem is not true if the convexity assumption is dropped. Examples 
illustrating this fact will be given in Chapters 2 and 5. 

In Section 6, randomized estimators were introduced, and such estimators may 
be useful, for example, in reducing the maximum risk (see Chapter 5, Example 
5.1.8), but this can never be the case when the loss function is convex. 

Corollary 7.9 Given any randomized estimator of g(9), there exists a nonran- 
domized estimator which is uniformly better if the loss function is strictly convex 
and at least as good when it is convex. 

Proof. Note first that a randomized estimator can be obtained as a nonrandomized 
estimator 8*(X. U), where X and U are independent and U is uniformly distributed 
on (0, 1). This is achieved by observing X = x and then using U to construct the 
distribution of Y given X = x, where Y = Y(x) is the random variable employed in 
the definition of a randomized estimator (Problem 7.10). To prove the theorem, we 
therefore need to show that given any estimator S*(X, U ) of g(O), there exists an 
estimator S(X), depending on X only, which has uniformly smaller risk. However, 
this is an immediate consequence of the Rao-Blackwell theorem since for the 
observations (X, U), the statistic X is sufficient. For 8(X), one can therefore take 
the conditional expectation of S*(A, U) given X. □ 

An estimator S is said to be inadmissible if there exists another estimator S' which 
dominates it (that is, such that R(9, S') < R(6, 8) for all 9, with strict inequality 
for some 0) and admissible if no such estimator S' exists. If the loss function L 
is strictly convex, it follows from Corollary 7.9 that every admissible estimator 
must be nonrandomized. Another property of admissible estimators in the strictly 
convex loss case is provided by the following uniqueness result. 

Theorem 7.10 IfL is strictly convex and 8 is an admissible estimator ofg(9), and 
if 8' is another estimator with the same risk function, that is, satisfying R(9, 8) = 
R(9, 8') for all 9, then 8' = 8 with probability 1. 

Proof. If 8* = i(«$ + S'), then 

* 1 

R(6, 8*) < ~[R(9, 8) + R(9, SO] = R(0, 8) 


(7.16) 
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unless S = S' with probability 1, and (7.16) contradicts the admissibility of <5. □ 

The preceding considerations can be extended to the situation in which the 
estimand g(0) = [gi(6), .. gk (&)] and the estimator <5(2Q = [<5i(X),..^-(X)] 
are vector-valued. 

Definition 7.11 For any two points x = {x\, ..., Xk) and y = (yi,..., }%) in Ek, 
define yx + (1 — y)y to be the point with coordinates yxi +(1 — y)>', , i = l,k. 

(i) A set S in Ek is convex if for any x, y e S, the points 

yx + (l - y)y, 0 < y < 1 

are also in S. (Geometrically, this means that the line segment connecting any 
two points in S lies in S.) 

(ii) A real-valued function (p defined over an open convex set S in Ek is convex 
if (7.1) holds with x and y replaced by x and y; it is strictly convex if the 
inequality is strict for all x and y. 

Example 7.12 Convex combination. If (pj is a convex function of a real variable 
defined over an interval / ; for each j = l,... ,k, then for any positive constants 
a\,... ,ak 

(7.17) </>(x) = T.ajfjixj) 

is a convex function defined over the ^-dimensional rectangle with sides l\..... ly, 
it is strictly convex, provided <p\, ...,<pk are all strictly convex. This example 
implies, in particular, that the loss function 

(7.18) L(6,d)=Xa i [d i -g i m 2 

is strictly convex. j 

A useful criterion to determine whether a given function <p is convex is the 
following generalization of (7.3). 

Theorem 7.13 Let (p be defined over an open convex set S in Ek and twice differ¬ 
entiable in S. Then, a necessary and sufficient condition for (p to be convex is that 
the k x k matrix with ijth element d 2 (p(xi, ..., Xk)/dxjdxj, which is known as the 
Hessian matrix, is positive semidefinite; if the matrix is positive definite, then cp is 
strictly convex. 

Example 7.14 Quadratic loss. Consider the loss function 

(7.19) L{B, d) = LYa^d, - gi m[dj - gfid)]. 

Since d 2 L/ddj ddj = a,j, L is strictly convex, provided the matrix | \a,j \ \ is positive 
definite. j 

Let us now consider some consequences of adopting a convex loss function in a 
location model. In Section 1, it was pointed out that there exists a unique number 
a minimizing E (x, — a) 2 , namely x, and that the minimizing value of E" =1 \x, — a \ 
is either unique (when n is odd) or the minimizing values constitute an interval. 
This interval structure of the minimizing values does not hold, for example, when 
minimizing E VI Xj — a\. In the case n = 2, for instance, there exist two minimizing 
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values, a = x i and a = X 2 (Problem 7.12). This raises the general question of the 
set of values a minimizing Ep(.r, — a), which, in turn, is a special case of the 
following problem. Let X be a random variable and L(f), d ) = p(d — 0) a loss 
function, with p even. Then, what can be said about the set of values a minimizing 
E[p{X — a)]? This specializes to the earlier case if X takes on the values X\,... , x n 
with probabilities 1/n each. 

Theorem 7.15 Let p be a convex function defined on (—oo, oo) and X a random 
variable such that <p(a) = E[p(X — o)] is finite for some a. If p is not monotone, 
4>(a) takes on its minimum value and the set on which this value is taken is a closed 
interval. If p is strictly convex, the minimizing value is unique. 

The proof is based on the following lemma. 

Lemma 7.16 Let (p be a convex function on (—oo, oo) which is bounded below 
and suppose that (p is not monotone. Then, (p takes on its minimum value; the set 
S on which this value is taken on is a closed inten’al and is a single point when (p 
is strictly convex. 

Proof. Since </> is convex and not monotone, it tends to oo as x -> ±oo. Since (p 
is also continuous, it takes on its minimizing value. That S is an interval follows 
from convexity and that it is closed follows from continuity. □ 


Proof of Theorem 7.15. By the lemma, it is enough to prove that </; is (strictly) 
convex and not monotone. That (p is not monotone follows from that fact that 
cp{a) —> oo as a —> ±oo. This latter property of r/j is a consequence of the facts 
that X — a tends in probability to oo as a -> ±oo and that p(t) -> oo as 
t —> ±oo. (Strict) convexity of <p follows from the corresponding property of p. 

□ 

Example 7.17 Squared error loss. Let p(t) = t 2 and suppose that E(X 2 ) < oo. 
Since p is strictly convex, if follows that <p(a) has a unique minimizing value. If 
E(X) = p, which by assumption is finite, we have, in fact, 

(7.20) <p(a) = E(X - a) 2 = E(X - pf + (p - a) 2 , 

which shows that <p{a) is a minimum if and only if a = p. j 

Example 7.18 Absolute error loss. Let p(t) = \t\ and suppose that E\X\ < oo. 
Since p is convex but not strictly convex, it follows from Theorem 7.15 that (p{a) 
takes on its minimum value and that the set S of minimizing values is a closed 
interval. The set S is, in fact, the set of medians of X (Problems 1.7 and 1.8). || 

The following is a useful consequence of Theorem 7.15 (see also Problem 7.27). 

Corollary 7.19 Under the assumptions of Theorem 7.15, suppose that p is even 
and X is symmetric about p. Then, (p(a) attains its minimum at a = p. 

Proof. By Theorem 7.15 the minimum is taken on. If p + c is a minimizing value, 
so is p — c and so, therefore, are all values a between p — c and p + c, which 
includes a = p. □ 
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Now consider an example in which p is not convex. 

Example 7.20 Nonconvex loss. Let p(t) = 1 if \t\ > k and p(t) = 0 otherwise. 
Minimizing 0(a) is then equivalent to maximizing i fr(a) = P(\X — a\ < k). 
Consider the following two special cases (Problem 7.22): 

(i) The distribution of X has a probability density (with respect to Lebesgue 
measure) which is continuous, unimodal, and such that f(x) decreases strictly 
as v moves away from the mode in either direction. Then, there exists a unique 
value a for which f(a — k) = f(a + k ), and this is the unique maximizing 
value of \//(a). 

(ii) Suppose that / is even and (/-shaped with f(x) attaining its maximum at 

v = ±A and f(x) = 0 for \x\ > A. Then, attains its maximum at the 
two points a = —A + k and a = A — k. | 

Convex loss functions have been seen to lead to a number of simplifications of 
estimation problems. One may wonder, however, whether such loss functions are 
likely to be realistic. If [JO. d ) represents not just a measure of inaccuracy but a 
real (for example, financial) loss, one may argue that all such losses are bounded: 
once you have lost all, you cannot lose any more. On the other hand, if d can take on 
all values in (—oo, oo) or (0, oo), no nonconstant bounded function can be convex 
(Problem 7.18). Unfortunately, bounded loss functions with unbounded d can lead 
to completely unreasonable estimators (see, for example. Theorem 2.1.15). The 
reason is roughly that arbitrarily large errors can then be committed with essentially 
no additional penalty and their leverage used to unfair advantage. Perhaps convex 
loss functions result in more reasonable estimators because the large penalties they 
exact for large errors compensate for the unrealistic assumption of unbounded d: 
They make such values so expensive that the estimator will try hard to avoid them. 
The most widely used loss function is squared error 

(7.21) L{e,d)=[d-g{Q)f 
or slightly more generally weighted squared error 

(7.22) L(6, d) = w(6)[d - g(0)] 2 . 

Since these are strictly convex in d, the simplifications represented by Theorem 
7.8, Corollary 7.9, and Theorem 7.10 are valid in these cases. The most slowly 
growing even convex loss function is absolute error 

(7.23) L(d,d)=\d-g(0)\. 

The faster the loss function increases, the more attention it pays to extreme 
values of the estimators and hence to outlying observations, so that the perfor¬ 
mance of the resulting estimators is strongly influenced by the tail behavior of 
the assumed distribution of the observable random variables. As a consequence, 
fast-growing loss functions lead to estimators that tend to be sensitive to the as¬ 
sumptions made about this tail behavior, and these assumptions typically are based 
on little information and thus are not very reliable. 

It turns out that the estimators produced by squared error loss often are uncom¬ 
fortably sensitive in this respect. On the other hand, absolute error appears to go 
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too far in leading to estimators which discard all but the central observations. For 
many important problems, the most appealing results are obtained from the use of 
loss functions which lie between (7.21) and (7.23). One interesting class of such 
loss functions, due to Huber (1964), puts 


(7.24) 


[d - 8(e)] 2 if | d - 8(d) I < k 

2k\d - 8(0) | - k 2 if | d - s(0) I > k. 


This agrees with (7.21) for | d — g(0)\ < k, but above k and below —k, it replaces 
the parabola with straight lines joined to the parabola so as to make the function 
continuous and continuously differentiable (Problem 7.21). 

The Huber loss functions are convex but not strictly convex. An alternative 
family, which also interpolates between (7.21) and (7.23) and which is strictly 
convex, is 

(7.25) L(0,d)= \d- g (0)\ p , \<p<2. 

It is a disadvantage of both (7.24) and (7.25) that the resulting estimators, even 
in fairly simple problems, cannot be obtained in closed form and hence are more 
difficult to grasp intuitively and to interpret. This may account at least in part for 
the fact that squared error is the most commonly used loss function or measure of 
accuracy and that the classic estimators in most situations are the ones derived on 
this basis. As indicated at the end of Section 1, we shall develop here the theory 
under the more general assumption of convex loss functions (which, in practice, 
does not appear to be a serious limitation), but we shall work most examples for 
the conventional squared error loss. The issue of the robustness of the resulting 
estimators, which requires going outside the assumed model, will not be treated 
in detail here. References for further study of robustness include Huber (1981), 
Hampel et al. (1986), and Staudte and Sheather (1990). 

With some care, the properties of convex and concave functions generalize to 
multivariate situations. For example. Theorem 7.4 generalizes to the following 
supporting hyperplane theorem for convex functions. 


Theorem 7.21 Let <p be a convex function defined over an open convex set S in 
E k and let t be any point in S. Then, there exists a hyperplane 


(7.26) 


y = k(x) = £C,(X; -tj) + <p(t) 


through the point [t, <p(i)] such that 

(7.27) L(x) < <p(x) for all x e S. 

Jensen’s inequality (Theorem 7.5) generalizes in the obvious way. The only 
changes that are needed are replacement of the interval I by an open convex set S , 
of the random variable X by a random vector X satisfying P(X e S) = 1, and of 
the expectation E(X) by the expectation vector /AX) = [E(X [),..., E(X k )]. For 
the resulting modification of the inequality (7.7) to be meaningful, it is necessary 
to know that E(X) is in S so that 0[P(X)] is defined. 

Lemma 7.22 If X is a random vector with P(X e S) = 1, where S is an open 
convex set in E k , and if E(X) exists, then P(X) e S. 
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A formal proof is given by Ferguson (1967, p. 74). Here, we shall give only a 
sketch. Suppose that k = 2, and suppose that £ = E(X) is not in S. Then, Theorem 
7.21 guarantees the existence of a line a \X\ + 02X2 = b through the point (£1, £2) 
such that S lies entirely on one side of the line. By a rotation of the plane, it can be 
assumed without loss of generality that the equation of the line is xo = §2 and that 
S lies above this line so that P(X 2 > £ 2 ) = 1. It follows that E(X 2 ) > £ 2 , which 
is a contradiction. 

The notions of convexity and concavity can also be extended to the multidi¬ 
mensional case in a slightly different way, one that examines the behavior of the 
function when it is averaged over spheres instead of over pairs of points. 

Definition 7.23 A continuous function / : K k —» K is superharmonic at a point 
xo e R k if, for every r > 0, the average of / over the surface of the sphere 
,SV(x 0 ) = {x : ||x — Xq11 = r] is less than or equal to /(xo). The function / is 
superharmonic in R p if it is superharmonic at each xo e R p . (See Problem 7.15 
for an extension.) 

If we denote the average of f over the surface of the sphere by A Xo (f), we thus 
define / to be superharmonic, harmonic, or subharmonic, depending on whether 
A X0 (f) is less than or equal to, equal to, or greater than or equal to f, respectively. 
These definitions are analogous to those of convexity and concavity, but here we 
take the average over the surface of a sphere. (Note that in one dimension, the sphere 
reduces to two points, so superharmonic and concave are the same property.) The 
following characterization of superharmonicity, which is akin to that of Theorem 
7.13, is typically easier to check than the definition. (For a proof, see Helms 1969). 

Theorem 7.24 If f : R k —> R is twice differentiable, then f is superharmonic in 
R k if and only if for all x e R k , 

(7.28) E^/W< 0. 

If Equation (7.28) is an equality, then f is harmonic, and if the inequality is 
reversed, then f is subharmonic. 

Example 7.25 Subharmonic functions. Some multivariate analogs of the con¬ 
vex functions in Example 7.3 are subharmonic. For example, if f(x\, ..., xf) = 
£*=1 *! then k k 

Y.ffj2fw = Y.p(p- v > x i~ 2 - 

i=l 0X i i =1 

This function is subharmonic if p > 1 and x, > 0, or if p > 2 is an even integer. 
Problem 7.14 considers some other multivariate functions. j 

Example 7.26 Subharmonic loss. The loss function of Example 7.14, given in 
Equation (7.19), has second derivative d 2 L/dd 2 = an. Thus, it is subharmonic if, 
and only if, an > 0. This is a weaker condition than that needed for multidi¬ 
mensional convexity. j 

The property of superharmonicity is useful in the theory of minimax point 
estimation, as will be seen in Section 5.6. 
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8 Convergence in Probability and in Law 

Thus far, our preparations have centered on “small-sample” aspects, that is, we 
have considered the sample size n as being fixed. However, it is often fruitful to 
consider a sequence of situations in which n tends to infinity. If the given sample 
size is sufficiently large, the limit behavior may provide an important complement 
to the small-sample behavior, and often discloses properties of estimators that are 
masked by complications inherent in small-sample calculations. In preparation for 
a study of such large-sample asymptotics in Chapter 6, we here present some of 
the necessary tools. 

In particular, we review the probabilistic foundations necessary to derive the 
limiting behavior of estimators. It turns out that under rather weak assumptions, 
the limit distribution of many estimators is normal and hence depends only on 
a mean and a variance. This mitigates the effect of the underlying assumptions 
because the results become less dependent on the model and the loss function. 

We consider a sample X = (Xi,..., X n ) as a member of a sequence corre¬ 
sponding to n = 1,2 (or, more generally, no, no + 1,...) and obtain the limiting 
behavior of estimator sequences as n -> oo. Mathematically, the results are thus 
limit theorems. 

In applications, the limiting results (particularly the asymptotic variances) are 
used as approximations to the situation obtaining for the actual finite n. A weakness 
of this approach is that, typically, no good estimates are available for the accuracy 
of the approximation. However, we can obtain at least some idea of the accuracy 
by numerical checks for selected values of n. 

Suppose for a moment that X \,..., X n are iid according to a distribution Pg, 9 e 
Q, and that the estimand is g(0 ). As n increases, more and more information about 

9 becomes available, and one would expect that for sufficiently large values of n, it 
would typically be possible to estimate g{9) very closely. If <5„ = S„(Xi, ..., X n ) 
is a reasonable estimator, of course, it cannot be expected to be close to g(9) for 
every sample point (x\,, x n ) since the values of a particular sample may always 
be atypical (e.g., a fair coin may fall heads in 1000 successive spins). What one 
can hope for is that S„ will be close to g(9 ) with high probability. 

This idea is captured in the following definitions, which do not assume the 
random variables to be iid. 

Definition 8.1 A sequence of random variables Y„ defined over sample spaces 

P 

(y „, B n ) tends in probability to a constant c(Y„ —»• c) if for every a > 0 

(8.1) P[\Y„ — c| > a] —¥■ 0 as n —> oo. 

A sequence of estimators S„ of g(9) is consistent if for every 9 e £2 

(8.2) S„ % g(9). 

The following condition, which assumes the existence of second moments, fre¬ 
quently provides a convenient method for proving consistency. 

Theorem 8.2 Let {<$„} be a sequence of estimators of g(9) with mean squared 
error E[8 n - g(0)] 2 . 
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(i) If 

(8.3) E[8 n - g«9)] 2 -> 0 for all 9, 
then 8 n is consistent for estimating g(9). 

(ii) Equivalent to (8.3), 8 n is consistent if 

(8.4) b n (0) -> 0 and vaig(8 n ) -> 0 for all 6 , 
where b n is the bias of 8 „. 

(Hi) In particular, 8 n is consistent if it is unbiased for each n and if 

(8.5) varo(<5„) -* 0 for all 9. 

The proof follows from Chebychev’s Inequality (see Problem 8.1). 

Example 8.3 Consistency of the mean. Let X \,..., X n be iid with expectation 
E(Xj ) = £ and variance a 1 < oo. Then, X is an unbiased estimator of § with vari¬ 
ance a 2 /n, and hence is consistent by Theorem 8.2(iii). Actually, it was proved by 
Khinchin, see, for example. Feller 1968, Chapter X, Section 1,2) that consistency 
of X already follows from the existence of the expectation, so that the assumption 
of finite variance is not needed. j 

Note. The statement that X is consistent is shorthand for the fuller assertion 
that the sequence of estimators X n =( X t + ■ ■ ■ + X„)/n is consistent. This type of 
shorthand is used very common and will be used here. However, the full meaning 
should be kept in mind. 

Example 8.4 Consistency of S 2 . Let X],.... X„ be iid with finite variance a 2 . 
Then, the unbiased estimator 

S 2 = Z(Xi - X) 2 /(n - 1) 

is a consistent estimator of a 2 . To see this, assume without loss of generality that 
E(Xj) = 0, and note that 



By Example 8.3, Y.X 2 /n -4- a 2 and X 2 4- 0. Since n/(n — 1) -> 1, it follows 
from Problem 8.4 that S 2 4- o 2 . (See also Problem 8.5.) || 

Example 8.5 Markov chains. As an illustration of a situation involving depen¬ 
dent random variables, consider a two-state Markov chain. The variables X\, X 2 ,... 
each take on the values 0 and 1, with the joint distribution determined by the initial 
probability P(X\ = 1) = pu and the transition probabilities 

P(x i+ 1 = 1\X, = 0) = TTo, p(x i+1 = 1 1 Xj = l) = 7 n, 

of which we shall assume 0 < 7To, jti < 1. For such a chain, the probability 


Pk = P(*k = 1) 
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typically depends on k and the initial probability p\ (but see Problem 8.10). How¬ 
ever, as k —> oo, pi tends to a limit p, which is independent of p\. It is easy to 
see what the value of p must be. Consider the recurrence relation 


(8.6) pi +1 = p k 7Ti + (1 - p k )7T 0 = pi(Tt\ - JTq) + 7T 0 . 

If 

(8.7) pi p. 


this implies 

( 8 . 8 ) 


P = 


7Tq 

1 — 7X\ + TTq 


To prove (8.7), it is only necessary to iterate (8.6) starting with k = 1 to find 
(Problem 8.6). 

(8.9) pi = (pi - p)(m - 7to) k ~ l + p. 


Since \jt\ — 7To| < 1, the result follows. 

For estimating p, after n trials, the natural estimator is X„, the frequency of ones 
in these trials. Since 

E(X n ) = (£>! + ••• + p„)/n, 

it follows from (8.7) that E(X n ) —»■ p (Problem 8.7), so that the bias of X n tends 
to zero. Consistency of X„ will therefore follow if we can show that var( X n ) -> 0. 
Now, 

n n 

var(Z„) = EE CO v(Xi, Xj)/n 2 . 

1=1 y'=l 

As n —> oo, this average of n 2 terms will go to zero if cov(X,, Xj) —>• 0 sufficiently 
fast as \j — i\ -> oo. The covariance of X, and Xj can be obtained by a calculation 
similar to that leading to (8.9) and satisfies 


( 8 . 10 ) |cov(Z,-, Xj) | < M\m - txo\ ] '. 

From ( 8 . 10 ), one finds that vari X n ) is of order 1 /n and hence that X„ is consistent 
(Problem 8 . 11 ). 

Instead of p, one may be interested in estimating no and 7T\ themselves. Again, 
it turns out that the natural estimator Nq\/(Nqq + Nq\ ) for no, where Noj is the 
number of pairs ( X A , +] ) with X = 0, X- !+ \ = j, j = 0, 1, is consistent. 

Consider, on the other hand, the estimation of p\. It does not appear that observa¬ 
tions beyond on the first provide any information about p\ , and one would therefore 
not expect to be able to estimate p\ consistently. To obtain a formal proof, suppose 
for a moment that the 7r’s are known, so that p\ is the only unknown parameter. If 
a consistent estimator S„ exists for the original problem, then S„ will continue to 
be consistent under this additional assumption. However, when the tt’s are known, 
X\ is a sufficient statistic for p\ and the problem reduces to that of estimating a 
success probability from a single trial. That a consistent estimator of p\ cannot 
exist under these circumstances follows from the definition of consistency. J 


When X \, ..., X n are iid according to a distribution P ,,, 0 e C, consistent 
estimators of real-valued functions of 0 will exist in most of the situations we 
shall encounter (see, for example. Problem 8.8). There is, however, an important 
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exception. Suppose the Y’s are distributed according to F(xj — 9) where F is 
N(t-, a 2 ), with 9, f, and a 2 unknown. Then, no consistent estimator of 9 exists. To 
see this, note that the Z’s are distributed as N(t- + 9, a 2 ). Thus, X is consistent for 
estimating f + 9, but § and 9 cannot be estimated separately because they are not 
uniquely defined, they are unidentifiable (see Definition 5.2). More precisely, for 
X ~ Pg^, there exist pairs ( 6 \, £i)and(02, ^2)with0i 62 for which Pg u ^ = Pe 2 £ 2 , 
showing the parameter 9 to be unidentifiable. A parameter that is unidentifiable 
cannot be estimated consistently since S(X \, ..., X„) cannot simultaneously be 
close to both 9 1 and 6b- 

Consistency is too weak a property to be of much interest in itself. It tells us 
that for large n, the error S„ — g(9) is likely to be small but not whether the order 
of the error is 1/n, I / Jn, 1/log n, and so on. To obtain an idea of the rate of 
convergence of a consistent estimator S„, consider the probability 


( 8 . 11 ) 


P„(a) = P 


\s„ - §m < 



If k„ is bounded, then P n (a) — a 1. On the other hand, if k n -a- 00 sufficiently fast, 
P„(a) —a 0. This suggests that for a given a > 0, there might exist an intermediate 
sequence k n -a 00 for which P n {a) tends to a limit strictly between 0 and 1. 
This will be the case for most of the estimators with which we are concerned. 
Commonly, there will exist a sequence k n —a 00 and a limit function H which is 
a continuous cdf such that for all a 


(8.12) P{k n [S„ — g(9)] < a] -a H(a ) as n -a 00 . 

We shall then say that the error |<5„ — g(9)\ tends to zero at rate 1 /k n . The rate, of 
course, is not uniquely determined by this definition. If 1 / k„ is a possible rate, so 
is 1 /k' n for any sequence k' n for which k' n / k n tends to a finite nonzero limit. On the 
other hand, if k' n tends to 00 more slowly (or faster) than k n , that is, if k' n /k n -a 0 
(or 00 ), then k' n [S n — g{9)] tends in probability to zero (or 00 ) (Problem 8.12). 

One can think of the normalizing constants k n in (8.12) in another way. If S„ is 
consistent, the errors S„ — g(9) tend to zero as n —> 00 . Multiplication by constants 
k n tending to infinity magnifies these minute errors—it acts as a microscope. If 
(8.12) holds, then k n is just the right degree of magnification to give a well-focused 
picture of the behavior of the errors. 

We formalize (8.12) in the following definition. 

Definition 8.6 Suppose that { Y n } is a sequence of random variables with cdf 

H n (a ) = P(Y n < a) 
and that there exists a cdf H such that 


(8.13) H n (a ) —a H(a ) 

at all points a at which H is continuous. Then, we shall say that the distribution 
functions H„ converge weakly to H , and that the Y„ have the limit distribution 
H , or converge in law to any random variable Y with distribution H. This will be 

c 

denoted by Y n —a Y or by C(Y n ) — a H. We may also say that Y„ tends in law to 
H and write Y n -a H. 
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The crucial assumption in (8.13) is that H(—oo) = 0 and H(+o o) = 1, that is, 
that no probability mass escapes to ±oo (see Problem 1.37). 

The following example illustrates the reason for requiring (8.13) to hold only 
for the continuity points of H. 

Example 8.7 Degenerate limit distribution. 

(i) Let Y n be normally distributed with mean zero and variance a 2 where a„ —> 0 
as n —> oo. 

(ii) Let Y n be a random variable taking on the value 1 /n with probability 1. 

In both cases, it seems natural to say that Y n tends in law to a random variable 
Y which takes on the value 0 with probability 1. The cdf H(a) of Y is zero for 
a < 0 and 1 for a > 0. The cdf H„(a) of Y„ in both (i) and (ii) tends to H(a) for 
all a f 0, but not for a = 0 (Problem 8.14). || 

An important property of weak convergence is given by the following theorem. 
Its proof, and those of Theorems 8.9-8.12, can be found in most texts on probability 
theory. See, for example, Billingsley (1995, Section 25). 

Theorem 8.8 The sequence Y„ converges in law to Y if and only if E[f(Y n )] —► 
E[f(Y)]for every bounded continuous real-valued function f. 

A basic tool for obtaining the limit distribution of many estimators of interest 
is the central limit theorem (CLT), of which the following is the simplest case. 

Theorem 8.9 (Central Limit Theorem) Let A, (i = 1, ..., n) be iid with E(Xj) = 
£ and var(X, ) = a 2 < oo. Then, *fn(X — £) tends in law to N( 0, a 2 ) and hence 
«/n(X — %)/a to the standard normal distribution N( 0, 1). 

The usefulness of this result is greatly extended by Theorems 8.10 and 8.12 
below. 

£ 

Theorem 8.10 If Y n —»• Y, and A„ and B n tend in probability to a and b, respec- 

C 

tively, then A n + B n Y n —> a + bY. 

When Y n converges to a distribution H , it is often required to evaluate prob¬ 
abilities of the form P(Y n < y„) where y n —> y, and one may hope that these 
probabilities will tend to H(y). 

c 

Corollary 8.11 If Y„ —> H, and y n converges to a continuity point y of H, then 
P(Yn < yn) -* H (y). 

Proof P(Y„ < y„) = P[Y„ + (y — y n ) < y] and the result follows from Theorem 
8.10 with B„ = 1 and A„ = y — y„. □ 

The following widely used result is often referred to as the delta method. 

Theorem 8.12 (Delta Method) If 

(8.14) V«[r„ - 6] 4 N (0, r 2 ), 

then 

(8.15) Vn[h(T n ) - h{6)} 4 N( 0, r 2 [h'(9)] 2 ), 

provided h'(9 ) exists and is not zero. 
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Proof. Consider the Taylor expansion of h(T n ) around h(0): 

(8.16) h(T n ) = h(9) + (T„ - 9)[h'(9) + R„], 

where R n —>• 0 as T n —> 9. It follows from (8.14) that T n -» 9 in probability and 
hence that R„ —> 0 in probability. The result now follows by applying Theorem 
8.10 to y/n[h(T n ) — h(9)]. □ 

Example 8.13 Limit of binomial. Let X,, i = 1, 2, ..., be independent Bernoulli 
(p) random variables and let T n = I YH=i ■ Then by the CLT (Theorem 8.9) 

(8.17) -Jn (T n — p) -> N [0, p(l — pf\ 

since E(T„) = p and vari7’„) = p( 1 — p). 

Suppose now that we are interested in the large sample behavior of the estimate 
T n { 1 — T„) of the variance h(p) = p( 1 — p). Since h'(p ) = 1 — 2/7, it follows from 
Theorem 8.12 that 

(8.18) JT, [T n { 1 - Tn) - p( 1 - P)] N [0, (1 - 2p) 2 p(l - p)\ 

for pf 1/2. || 

When the dominant term in the Taylor expansion (8.16) vanishes [as it does at 
p = 1/2 in (8.18)], it is natural to carry the expansion one step further to obtain 

h(T n ) = h(0) + (T n - 9)h'(6) + \(T n - 9) 2 [h'\0) + R„], 

where R„ -> 0 in probability as T n —> 9, or, since h'(9) = 0, 

(8.19) h(T n ) - 1,(9) = l -(T„ - 9) 2 [h"(9) + R n ]. 

In view of (8.14), the distribution of | ^fn(T„ — 9)] 1 tends to a nondegenerate limit 
distribution, namely (after division by r 2 ) to a / 2 -distribution with 1 degree of 
freedom, and hence 

(8.20) n(T n - 9) 2 -> r 2 • y 2 . 

The same argument as that leading to (8.15), but with h'(9) = 0 and h"(9) f 0, 
establishes the following theorem. 

Theorem 8.14 If *Jn[T n — 9] N( 0, r 2 ) and ifh!(9 ) = 0, then 

(8.21) n[h(T„) - h(9)] -> \r 2 h"(9)y 2 
provided h"(9) exists and is not zero. 

Example 8.15 Continuation of Example 8.13. For h(p) = p(l — p). we have, at 
p = 1/2, h'(l/2) = 0 and h" (1/2) = —2. Hence, from Theorem 8.14, at p = 1/2, 

(8.22) n T n (\ -T n )- ] - ^ ~\xl 

Although (8.22) might at first appear strange, note that 7/(1 — T n ) < 1/4, so the 
left side is always negative. An equivalent form for (8.22) is 

2 n ^ - 7/(1 - T n ) xl■ II 
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The typical behavior of estimator sequences as sample sizes tend to infinity 
is that suggested by Theorem 8.12, that is, if S„ is the estimator of g{9) based 
on n observations, one may expect that y/n[S„ — g(0)] will tend to a normal 
distribution with mean zero and variance, say r 2 (9). It is in this sense that the 
large-sample behavior of such estimators can be studied without reference to a 
specific loss function. The asymptotic behavior of <5„ is governed solely by r 2 (0) 
since knowledge of r 2 (0) determines the probability of the error s /n[8 n — g(9)] 
lying in any given interval. In particular, r 2 (0) provides a basis for the large-sample 
comparison of different estimators. 

Contrast this to the finite-sample situation where, for example, if estimators are 
compared in terms of their risk, one estimator might be best in terms of absolute 
error, another for squared error, and still another in terms of a higher power of the 
error or the probability of falling within a stated distance of the true value. This 
cannot happen here, as r 2 (0) provides the basis for all large-sample evaluations. 

It is straightforward to generalize the preceding theorems to functions of several 
means. The expansion (8.16) is replaced by the corresponding Taylor’s theorem in 
several variables. Although the following theorem starts in a multivariate setting, 
the conclusion is univariate. 

Theorem 8.16 Let (X iv ,..., X sv ), v = 1, .. ( , n, be n independent s-tuples of 
random variables with E(Xj v ) = and cov(Xi v , Xj v ) = ay,-. Let Xj = T,Xj v /n, 
and suppose that h is a real-valued function of s arguments with continuous first 
partial derivatives. Then, 

Mh(x 1 ,. ..,X S )~ hif\ ,..., &)] 4 N( 0, v 2 ), v 2 = SE CTi7 4r • 

provided v 2 > 0. 

Proof. See Problem 8.20. □ 

Example 8.17 Asymptotic distribution of S 2 . As an illustration of Theorem 
8.16, consider the asymptotic distribution of S 2 = E(Z V , — Z) 2 /n where the Z’s 
are iid. Without loss of generality, suppose that E(Z V ) = 0, E(Z 2 ,) = a 2 . Since 
S 2 = (1 ln)Y.Z 2 — Z 2 , Theorem 8.16 applies with X\ v = Z 2 , X^ v = Z v , h(x i, xf) = 
X| — x 2 , £2 = 0, and = var(Z„) = a 2 . Thus, ti(S 2 — a 2 ) -a- N( 0, v 2 ) where 
v 2 = var(Zj;). || 

We conclude this section by considering the multivariate case and extending 
some of the basic probability results for random variables to vectors of random 
variables. The definitions of convergence in probability and in law generalize very 
naturally as follows. 

Definition 8.18 A sequence of random vectors Y„ = (Y \„,..., Y nl ), 11 = 1,2,..., 

p 

tends in probability toward a constant vector c = (ci,..., c r ) if Y in —»• a for each 
i = I,..../', and it converges in law (or weakly ) to a random vector Y with cdf H 
if 
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at all continuity points a of H, where 

(8.24) H n (a)=P[Y ln <a u ...,Y rn <a r ] 
is the cdf of Y„. 

Theorem 8.8 extends to the present case. 

Theorem 8.19 The sequence [Y„} converges in law to Y if and only ifE[f( Y,,)] -> 
E\f(Y )] for every bounded continuous real-valued f. 

[For a proof of this and Theorem 8.20, see Billingsley (1995, Section 29).] 

Weak convergence of Y„ to Y does not imply 

(8.25) P( Y„ € A) P (Y e A) 

for all sets A for which these probabilities are defined since this is not even true 
for the set A defined by 

T\ < a \,..., T r <a r 

unless H is continuous at a. 

Theorem 8.20 The sequence {Y„} converges in law to Y if and only if( 8.25) holds 
for all sets A for which the probabilities in question are defined and for which the 
boundary of A has probability zero under the distribution of Y. 

As in the one-dimensional case, the central limit theorem provides a basic tool 
for multivariate asymptotic theory. 

Theorem 8.21 (Multivariate CLT) Let X y = (Xiy, ..., X rv ) be iid with mean 
vector § = (§i, ..., f,.) and covariance matrix E = 11cr,y11, and let X, n = (X,\ + 

■ ■ ■ + Xj„)/n. Then, 

[Vn(X i„ - §0,..., sfn(X rn - f r )] 

tends in law to the multivariate normal distribution with mean vector 0 and co- 
variance matrix E. 

As a last result, we mention a generalization of Theorem 8.16. 

Theorem 8.22 Suppose that 

[Vn(Y ln - di),sfn(Y rn - 9 r )] 

tends in law to the multivariate normal distribution with mean vector 0 and co- 
variance matrix E, and suppose that h\,... ,h r are r real-valued functions of 
6 = (0 1 , , 9 r ), defined and continuously differentiable in a neighborhood co of 

the parameter point 9 and such that the matrix B = 13/?,-/3 9j \ \ of partial deriva¬ 
tives is nonsingular in w. Then, 

[VTi[hf Y„) - hfO)],spn\h r { Y„) - h r (6)]\ 

tends hi law to the multivariate normal distribution with mean vector 0 and with 
covariance matrix II E IT. 


Proof. See Problem 8.27 


□ 
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9 Problems 
Section 1 

1.1 If (jti, Vi), • • •, (x n , y n ) are n points in the plane, determine the best fitting line y = 
a + p.x in the least squares sense, that is, determine the values a and /3 that minimize 
S[y ; -(ce + fiXi)] 2 . 

1.2 Let X\, ..., X„ be uncorrelated random variables with common expectation 9 and 
variance a 2 . Then, among all linear estimators So , i X t of 9 satisfying Ea, = 1, the mean 
X has the smallest variance. 

1.3 In the preceding problem, minimize the variance of Ea,X;(Ea,- = 1) 

(a) When the variance of Xj is ff 2 /oij (a, known). 

(b) When the Xj have common variance a 2 but are correlated with common correlation 
coefficient p. 

(For generalizations of these results see, for example, Watson 1967 and Kruskal 1968.) 

1.4 Let X and Y have common expectation 9, variances a 2 and r 2 , and correlation coeffi¬ 
cient p. Determine the conditions on a, r, and p under which 

(a) var(X) < var[(X + Y)/ 2], 

(b) The value of a that minimizes varfo'A' + (1 — a)Y] is negative. 

Give an intuitive explanation of your results. 

1.5 Let Xj (i = 1, 2) be independently distributed according to the Cauchy densities 
C(a l , bj). Then, Xj + X 2 is distributed as Cfflj + a 2 . b i + b 2 ). [Hint: Transform to new 
variables Y, = X, + X 2 , Y 2 = X 2 .[ 

1.6 If Xj, ..., X n are iid as C(a, b), the distribution of X is again C(a, b). [Hint: Prove by 
induction, using Problem 5.] 

1.7 A median of X is any value m such that P(X < m) >1/2 and P(X > m) > 1/2. 

(a) Show that this is equivalent to P(X < m) <1/2 and P(X > m ) 5 1/2. 

(b) Show that the set of medians is always a closed interval m 0 <m<m\. 

1.8 If cp(a ) = E\X — a\ < oo for some a, show that <j>{a) is minimized by any median of 
X. [Hint: If mo < m < mi (in the notation of Problem 1.7) and mi < c, then 

E\X - c| - E\X — m\ = (c — m)[P(X < m) - P(X > m)] + 2 [ ( c-x)dP(x )]. 

J m<x<c 

1.9 (a) The median of any set of distinct real numbers X \,..., x„ is defined to be the 

middle one of the ordered jc’s when n is odd, and any value between the two middle 
ordered x’s when n is even. Show that this is also the median of the random variable 
X which takes on each of the values X \, ..., x„ with probability 1 /n. 

(b) For any set of distinct real numbers Xi,... ,x„, the sum of absolute deviations 
E \ xj — a | is minimized by any median of the ar’s. 

(c) For n given points (x,-, >’;), i = 1, ..., n, find the value b that minimizes E | y { — bx t \. 
[Hint: Reduce the problem to a special case of Problem 8.] 

1.10 For any set of numbers x \, • • •, x„ and a monotone function h(-), show that the value of 
a that minimizes Xa =1 \h( x i ) — («)] 2 is given by a = ~ 1 (Xa =1 h( x i )/")■ Find functions 
h that will yield the arithmetic, geometric, and harmonic means as minimizers. 

[Hint: Recall that the geometric mean of non-negative numbers is (|~[ x ‘)' " anc l the 
harmonic mean is [(l/«)^(l/x,)] *. This problem, and some of its implications, is 
considered by Casella and Berger (1992).] 
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1.11 (a) If two estimators 5], S 2 have continuous symmetric densities f(x — 6), i = 1, 2, 

and /i(0) > //(O), then 

P[|<5i — 9\ < c] > .P[|<52 — 9 1 < c] for some c > 0 

and hence <5i will be closer to 9 than with respect to the measure (1.5). 

(b) Let X , Y be independently distributed with common continuous symmetric density 
/. and let <5, = X, S 2 = (X + Y)/2. The inequality in part (a) will hold provided 
2/ f(x)dx < /(0) (Edgeworth 1883, Stigler 1980). 

1.12 (a) Let fix) = (1/2 )(k — 1)/(1 + \x\) k , k > 2. Show that / is a probability density 

and that all its moments of order < k — 1 are finite. 

(b) The density of part (a) satisfies the inequality of Problem 1.11(b). 

1.13 (a) If X is binomial b(p, n), show that 

E\--p\=l( n ~ !) p k (l - pf- M for — < p < -. 

I n I \ k — 1 / n n 

(b) Graph the risk function of part (i) for n = 4 and n = 5. 

[Hint: For (a), use the identity 

(")(x-np) = «[(”: 1 i)(l-p)-( n ; 1 )p], l < x < n. 

(Johnson 1957-1958, and Blyth 1980).] 

Section 2 

2.1 If A i, A 2 , ... are members of a a-field A (the A's need not be disjoint), so are their 
union and intersection. 

2.2 For any a < b, the following sets are Borel sets (a) [x : a < x} and (b) [x : a < x < b). 

2.3 Under the assumptions of Problem 2.1, let 

A = liminf A„ = [x : x e A„ for all except a finite number of n’s }, 

A = lint sup A„ = [x : x e A„ for infinitely many nj. 

Then, A and A are in A. 

2.4 Show that 

(a) If Ai C A 2 C • ■ •, then A = A = U A„. 

(b) If Ai D A 2 D ■ ■ •, then A = A = fl A„. 

2.5 For any sequence of real numbers ai, a 2 , . . show that the set of all limit points of 
subsequences is closed. The smallest and largest such limit point (which may be infinite) 
are denoted by lim inf aj, and lim sup a*, respectively. 

2.6 Under the assumptions of Problems 2.1 and 2.3, show that 

lAx) = lint inf lA k (x ) and /j(jc) = lim sup lA t (x ) 

where Ia(x) denotes the indicator of the set A. 

2.7 Let (X, A, p.) be a measure space and let B be the class of all sets A U C with A e A 
and C a subset of a set A' e A with p{A') = 0. Show that B is a a-field. 

2.8 If / and g are measurable functions, so are (i) f + g, and (ii) max(/, g). 

2.9 If / is integrable with respect to /i, so is |/|, and | f f dp | < f \ f\ dp. [Hint: Express 
|/| in terms of f* and /“.] 
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2.10 Let X = [xi, X 2 ,...}, (J. = counting measure on X, and / integrable. Then f fd[i = 
Y.f{Xi). [Hint: Suppose, first, that / > 0 and let s n (x ) be the simple function, which is 
f(x) for x = X \,..., x„, and 0 otherwise.] 

2.11 Let f(x) = 1 or 0 as x is rational or irrational. Show that the Riemann integral of / 
does not exist. 


Section 3 

3.1 Let X have a standard normal distribution and let Y = 2X. Determine whether 

(a) the cdf F(x, y) of ( X , Y) is continuous. 

(b) the distribution of (X, Y) is absolutely continuous with respect to Lebesgue measure 
in the (x, y) plane. 

3.2 Show that any function f which satisfies (3.7) is continuous. 

3.3 Let If be a measurable transformation from (£, B) to (X, A) (i.e., such that for any 
A e A, the set [e : X(e) € A} is in B), and let Y be a measurable transformation from 
(X, A) to (y, C ). Then, Y[X(e )] is a measurable transformation from (£ , B) to (y , C). 

3.4 In Example 3.1, show that the support of P is [a , b] if and only if F is strictly increasing 
on [a, b], 

3.5 Let S be the support of a distribution on a Euclidean space (X, .4). Then, (i) S is closed; 
(ii) P(S ) = 1; (Hi) S is the intersection of all closed sets C with P(C) = 1. 

3.6 If P and Q are two probability measures over the same Euclidean space which are 
equivalent, then they have the same support. 

3.7 Let P and Q assign probabilities 

P : P =p n > 0, n = l,2,... (Yp n = 1), 

Q: P(X = Q)= 1 -- p(x= = ? „ >0; n = l,2,... ^ . 

Then, show that P and Q have the same support but are not equivalent. 

3.8 Suppose X and Y are independent random variables with X ~ E(X, 1) and Y ~ 
E(fi, 1). It is impossible to obtain direct observations of X and Y. Instead, we observe 
the random variables Z and W, where 

Z = mm(X,Y} and ^ = j „ $z = Y 

Find the joint distribution of Z and W and show that they are independent. (The X and 
Y variables are censored., a situation that often arises in medical experiments. Suppose 
that X measures survival time from some treatment, and the patient leaves the survey 
for some unrelated reason. We do not get a measurement on X, but only a lower bound.) 


Section 4 

4.1 If the distributions of a positive random variable X form a scale family, show that the 
distributions of log X form a location family. 

4.2 If X is distributed according to the uniform distribution U (0, 9), show that the distri¬ 
bution of — log X is exponential. 
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4.3 Let U be uniformly distributed on (0, 1) and consider the variables X = U a , 0 < a. 
Show that this defines a group family, and determine the density of X. 

4.4 Show that a transformation group is a group. 

4.5 If go is any element of a group G, show that as g ranges over G so does ggo. 

4.6 Show that for p = 2, the density (4.15) specializes to (4.16). 

4.7 Show that the family of transformations (4.12) with B nonsingular and lower triangular 
form a group G. 

4.8 Show that the totality of nonsingular multivariate normal distributions can be obtained 
by the subgroup G of (4.12) described in Problem 4.7. 

4.9 In the preceding problem, show that G can be replaced by the subgroup Go of lower 
triangular matrices B = (i>, 7 ), in which the diagonal elements/?,!, • • ■. b pp are all positive, 
but that no proper subgroup of Go will suffice. 

4.10 Show that the family of all continuous distributions whose support is an interval with 
positive lower end point is a group family. [Hint: Let U be uniformly distributed on the 
interval (2, 3) and let X = b[g(U)] a where a, b > 0 and where g is continuous and 1:1 
from (2, 3) to (2, 3).] 

4.11 Find a modification of the transformation group (4.22) which generates a random 
sample from a population {yi,..., yj\r} where the y’s, instead of being arbitrary, are 
restricted to (a) be positive and (b) satisfy 0 < y ; < I. 

4.12 Generalize the transformation group of Example 4.10 to the case of s populations 
[yij, j = 1 ,,Ni],i = 1, ... ,s, with a random sample of size n,- being drawn from 
the ; th population. 

4.13 Let U be a positive random variable, and let 

X = bU ' /c , b > 0, c > 0. 

(a) Show that this defines a group family. 

(b) If U is distributed as £(0, 1), then X is distributed according to the Weibull distri¬ 
bution with density 

C ( X Y~ l e~^ c , x > 0 . 

b \b> 

4.14 If F and Fo are two continuous, strictly increasing cdf’s on the real line, and if the 
cdf of U is F 0 and g is strictly increasing, show that the cdf of g(U) is F if and only if 
g = F~\F 0 ). 

4.15 The following two families of distributions are not group families: 

(a) The class of binomial distributions b(p, n), with n fixed and 0 < p < 1. 

(b) The class of Poisson distributions P(X), 0 < X. 

[Hint: (a) How many 1:1 transformations are there taking the set of integers (0, 1,..., n ) 
into itself?] 

4.16 Let Xi,..., X r have a multivariate normal distribution with E(Xj) = and with 
covariance matrix E. If X is the column matrix with elements X, and B is an r x r 
matrix of constants, then BX has a multivariate normal distribution with mean B| and 
covariance matrix BE S'. 


Section 5 

5.1 Determine the natural parameter space of (5.2) when 5 = 1, T\(x) = x, p is Lebesgue 
measure, and h(x) is (i) e _|x| and (ii) +x 2 ). 
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5.2 Suppose in (5.2), s = 2 and T 2 {x) = T^x). Explain why it is impossible to estimate rp. 
[Hint: Compare the model with that obtained by putting rj[ = qi + c, q ' 2 = 112 — c.] 

5.3 Show that the distribution of a sample from the p-variate normal density (4.15) con¬ 
stitutes an -dimensional exponential family. Determine s and identify the functions p ; , 
Tj, and B of (5.1). 

5.4 Efron (1975) gives very general definitions of curvature, which generalize (10.1) and 
(10.2). For the .s-dimensional family (5.1) with covariance matrix Eg, if 9 is a scalar, 
define the statistical curvature to be y e = (|M e |/»!| 1 ) 1 ' , ~ where 

M _ ('"11 m n \ _ f r}' e i e t} e q' g Egijg\ 

9 \m2\m22) e rj'g'Egiig ) ’ 

with t](0) = [rii(0 )), i]{0) = {/jl(6 >)} and rj(G) = [q/(6)}. Calculate the curvature of the 
family (see Example 6.19) C exp [— ^" =1 (x ; — 0)'"] for m =2, 3, 4. Are the values of 
yg ordered in the way you expected them to be? 

5.5 Let (Xi, X 2 ) have a bivariate normal distribution with mean vector £ = (§ 1 , § 2 ) and 
identity the covariance matrix. In each of the following situations, verify the curvature, 
y e of the family. 


(a) § = (0, 9), yg = 0. 

(b) $ = (e l ,e 2 ),el+e? = r 2 ,yg = i/r. 

5.6 In the density (5.1) 

(a) For s = 1 show that Eg [T(X)] = B\9)Mm and var e [r(X)] = ^ 

(b) For s > 1, show that Eg [T (X)] = J~ l VB where J is the Jacobian matrix defined 
by J = {^-J and VB is the gradient vector VB = {g§:B(0)}. 

(See Johnson, Ladalla, and Liu (1979) for a general treatment of these identities.) 

5.7 Verify the relations (a) (5.22) and (b) (5.26). 

5.8 For the binomial distribution (5.28), verify (a) the moment generating function (5.30) 
and (b) the moments (5.31). 

5.9 For the Poisson distribution (5.32), verify the moments (5.35). 

5.10 In a Bernoulli sequence of trials with success probability p, let X + m be the number 
of trials required to achieve m successes. 


(a) Show that the distribution of X, the negative binomial distribution, is as given in 
Table 5.1. 

/ \ —m 

(b) Verify that the negative binomial probabilities add up to 1 by expanding ( j — | J 
= p m (l-qy m . 

(c) Show that the distributions of (a) constitute a one-parameter exponential family. 

(d) Show that the moment generating function of X is M x (u ) = p m /{ 1 — qe") m . 

(e) Show that E(X) = mq/p and var(X) = mq/p 2 . 

(f) By expanding K x {u), show that the first four cumulants of X are k t = mq/p, 
k 2 = mq/p 2 , ki = mq{ 1 + q)/p 3 , and = mq{ 1 + 4 q + q 2 )/p 4 . 


5.11 In the preceding problem, let X, + 1 be the number of trials required after the (i — l)st 
success has been obtained until the next success occurs. Use the fact that X = S'"[X,- 
to find an alternative derivation of the mean and variance in part (e). 
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5.12 A discrete random variable with probabilities 

P(X = x ) = a(x)6 x /C(6), x = 0A,...; a(x) > 0; 9 > 0, 

is a power series distribution. This is an exponential family (5.1) with s = l,ij = log#, 
and T = X. The moment generating function is Mx(u) = C(8e u )/C(9). 

5.13 Show that the binomial, negative binomial, and Poisson distributions are special cases 
of the power series distribution of Problem 5.12, and determine 9 and C(9). 

5.14 The distribution of Problem 5.12 with a(x) = l fx and C(8) = — log(l — 6), x = 
1, 2,... ;0 < 9 < 1, is the logarithmic series distribution. Show that the moment 
generating function is log(l — 9e u )/ log(l — 9) and determine E(X ) and var(X). 

5.15 For the multinomial distribution (5.4), verify the moment formulas (5.16). 

5.16 As an alternative to using (5.14) and (5.15), obtain the moments (5.16) by representing 
each Xi as a sum of n indicators, as was done in (5.5): 

5.17 For the gamma distribution (5.41). 

(a) verify the formulas (5.42), (5.43), and (5.44); 

(b) show that (5.43), with the middle term deleted, holds not only for all positive 
integers r but for all real r > —a. 

5.18 (a) Prove Lemma 5.15. (Use integration by parts.) 

(b) By choosing g(x) to be x 2 and .y 3 , use the Stein Identity to calculate the third and 
fourth moments of the N(p, cr 2 ) distribution. 

5.19 Using Lemma 5.15: 

(a) Derive the form of the identity for X ~ Gamma(», b) and use it to verify the 
moments given in (5.44). 

(b) Derive the form of the identity for X ~ Betafa, b), and use it to verify that E(X) = 
a/(a +b) and var(X) = ab/(a + b) 2 (a + b+ 1). 

5.20 As an alternative to the approach of Problem 5.19(b) for calculating the moments of 
X ~ B(a, b ), a general formula for EX k (similar to equation (5.43)) can be derived. 
Do so, and use it to verify the mean and variance of X given in Problem 5.19. [Hint: 
Write EX k as the integral of x c_1 (l — .y/ - 1 and use the constant B(c, d) of Table 5.1. 
Note that a similar approach will work for many other distributions, including the x 2 . 
Student’s t , and F distributions.] 

5.21 The Stein Identity can also be applied to discrete exponential families, as shown by 
Hudson (1978) and generalized by Hwang (1982a). If X takes values in N = {0, 1, ..., } 
with probability function 

Pa(. y) = exp[#.Y — B(9)]li(x), 

then for any g : N —> Ih with E e \g{X)\ < oo, we have the identity 
Eg(X) = e- e E{t(X)g(X- 1)} 
where t{ 0) = 0 and t{x) = h{x — 1)/ h{x) for x > 0. 

(a) Prove the identity. 

(b) Use the identity to calculate the first four moments of the binomial distribution 

(5.31). 

(c) Use the identity to calculate the first four moments of the Poisson distribution 
(5.35). 
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5.22 The inverse Gaussian distribution, IG(X, p ), has density function 

x > 0, X, p > 0. 

(a) Show that this density constitutes an exponential family. 

(b) Show that this density is a scale family (as defined in Example 4.1). 

(c) Show that the statistics X = (l/n)Ex ; and S* = E(1 /xt — I /x) are complete 
sufficient statistics. 

(d) Show that X ~ IG(nX, np) and 5* - (1 /A)x„G, . 

Note: Together with the normal and gamma distributions, the inverse Gaussian completes 
the trio of families that are both an exponential and a group family of distributions. This 
fact plays an important role in distribution theory based on saddlepoint approximations 
(Daniels 1983) or likelihood theory (Barndorff-Nielsen 1983). 

5.23 In Example 5.14. show that 

(a) Xi is the distribution of Y 1 where Y is distributed as N( 0, 1); 

(b) Xn is the distribution of Yf + ■ ■ ■ + Y„ where the Yj are independent N(0, 1). 

5.24 Determine the values a for which the density (5.41) is (a) a decreasing function of x 
on (0, oo) and (b) increasing for x < Xo and decreasing for x > .To(0 < *o)- In case (b), 
determine the mode of the density. 

5.25 A random variable X has the Pareto distribution P{c,k) if its cdf is 1 — (k/x) c , 
x > k > 0. c > 0. 

(a) The distributions P(c, 1) constitute a one-parameter exponential family (5.2) with 
?7 = — c and T = log X. 

(b) The statistic T is distributed as E{\ogk, 1/c). 

(c) The family P(c, k ) (0 < k, 0 < c) is a group family. 

5.26 If (X, Y) is distributed according to the bivariate normal distribution (4.16) with 
? = n = 0: 

(a) Show that the moment generating function of (X, Y) is 

M XiY (u u u 2 ) = t 2 ]/2_ 



(b) Use (a) to show that 

M12 = M21 = 0, fin = pox, 

/x 13 = 3pox , /in = 3 per r, fi 22 = (1 + 2 / o“)cr“T“. 

5.27 (a) If X is a random column vector with expectation £, then the covariance matrix 

of X is cov(X) = E[(X’ - £)(X' -?')]• 

(b) If the density of X is (4.15), then f = a and cov(X) = E. 

5.28 (a) Let X be distributed with density pe(x) given by (5.1), and let A be any fixed 

subset of the sample space. Then, the distributions of X truncated on A, that is, 
the distributions with density p s (x)1a(x)/ Pg(A) again constitute an exponential 
family. 

(b) Give an example in which the natural parameter space of the original exponential 
family is a proper subset of the natural parameter space of the truncated family. 



1.9] 


PROBLEMS 


69 


5.29 If Xj are independently distributed according to r(a ; , b), show that £ X t is distributed 
as r(Ea,-, b). [Hint: Method 1. Prove it first for the sunt of two gamma variables by a 
transformation to new variables Y t = X t + X 2 , Y 2 = Xi/X 2 and then use induction. 
Method 2. Obtain the moment generating function of £X ; and use the fact that a distri¬ 
bution is uniquely determined by its moment generating function, when the latter exists 
for at least some u X 0.] 

5.30 When the X t are independently distributed according to Poisson distributions P(Xi), 
find the distribution of £ X t . 

5.31 Let X t . X„ be independently distributed as Tfa, b). Show that the joint distribu¬ 

tion is a two-parameter exponential family and identify the functions rp, 7j, and B of 
(5.1). 

5.32 If Y is distributed as F(a, b). determine the distribution of c log Y and show that for 
fixed a and varying b it defines an exponential family. 

5.33 Morris (1982, 1983b) investigated the properties of natural exponential families with 
quadratic variance functions. There are only six such families: normal, binomial, gamma, 
Poisson, negative binomial, and the lesser-known generalized hyperbolic secant distri¬ 
bution, which is the density of X = ^ log(y^p) when Y ~ Beta(i + |,| — |), |0| < |. 

(a) Find the density of X, and show that it constitutes an exponential family. 

(b) Find the mean and variance of X, and show that the variance equals 1 + pr, where 
p is the mean. 

Subsequent work on quadratic and other power variance families has been done by Bar- 
Lev and Enis (1986, 1988), Bar-Lev and Bshouty (1989), and Letac and Mora (1990). 


Section 6 

6.1 Extend Example 6.2 to the case that X 4 ,.... X r are independently distributed with 
Poisson distributions P(X,) where L, = aX (a t > 0, known). 

6.2 Let X i,..., X n be iid according to a distribution F and probability density /. Show that 
the conditional distribution given X( t) = a of the i — 1 values to the left of a and the n — i 
values to the right of a is that of i — 1 variables distributed independently according to the 
probability density f(x)/F(a) and n — i variables distributed independently with density 
f{x)/[ 1 — F(a)], respectively, with the two sets being (conditionally) independent of 
each other. 

6.3 Let / be a positive integrable function over (0, oo), and let p g (x) be the density over 
(0, 9) defined by pe(x) = c(0) f(x) if 0 < x <9, and 0 otherwise. If Xi,..., X n are iid 
with density p g , show that X (n) is sufficient for 9. 

6.4 Let / be a positive integrable function defined over (—oo, oo) and let p* tl (x) b e th e 
probability density defined by p^ n (x) = c($, rj)f(x) if $ < x < and 0 otherwise. If 
Xi,..., X n are iid with density p^ n , show that (Z(i), X (n )) is sufficient for (§, n). 

6.5 Show that each of the statistics 7j — T 4 of Example 6.11 is sufficient. 

6.6 Prove Corollary 6.13. 

6.7 Let X It ..., X m and F,.F„ be independently distributed according to N(£, a 2 ) 

and N(tj, r 2 ), respectively. Find the minimal sufficient statistics for these cases: 

(a) §, o, x are arbitrary: —oo < f, r/ < oo, 0 < a, x. 

(b) a = x and §, cr are arbitrary. 

(c) £ = )/ and £, a, x are arbitrary. 
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6.8 Let Xi,..., X„ be iid according to N(o, cr 2 ), 0 < a. Find a minimal set of sufficient 
statistics. 

6.9 (a) If (xi, , x n ) and y n ), have the same elementary symmetric functions 

Ex; = Hy i ,'E i¥j x i y j = E^jy^j, - x 1 ---x„ = yi ■ ■ ■ y n , then the /s are a 

permutation of the x’s. 

(b) In the notation of Example 6.10, show that U is equivalent to V. [Hint: Compare 
the coefficients and the roots of the polynomials P(x) = Tl(x — uf) and Q(x ) = 
R(x - Vi).] 

6.10 Show that the order statistics are minimal sufficient for the location family (6.7) when 
/ is the density of 

(a) the double exponential distribution D{ 0, 1). 

(b) the Cauchy distribution C(0, 1). 

6.11 Prove the following generalization of Theorem 6.12 to families without common 
support. 

Theorem 9.1 Let V be a finite family with densities pt,i = 0, .... k, and for any x, 
let Six) be the set of pairs of subscripts (i, j)for which pfix) + p fix) > 0. Then, the 
statistic 

T(X) = ( ' < i and O'. J) e 5(X)j 

I PfiX) ) 

is minimal sufficient. Here, Pj(x)/pfix) = oo if p,(x ) = 0 and Pj(x) > 0. 

6.12 In Problem 6.11 it is not enough to replace p,(X) by pfiX). To see this let k = 2 and 
p 0 = U(— 1, 0), pi = U( 0. 1), and p 2 lx) = 2x, 0 < x < 1. 

6.13 Let k = 1 and P { = U{i, i + 1), (' = 0, 1. 

(a) Show that a minimal sufficient statistic for V = {/o> F’iJisrfX) = i if ( < X < f+1, 
i =0, 1. 

(b) Let X] and X 2 be iid according to a distribution from V. Show that each of the two 
statistics 7) = T(Xfi and T 2 = T(X 2 ) is sufficient for (Xj, X 2 ). 

(c) Show that jT(Xi) and T(X 2 ) are equivalent. 

6.14 In Lemma 6.14, show that the assumption of common support can be replaced by 
the weaker assumption that every Po-null set is also a T’-null set so that (a.e. Vo) is 
equivalent to (a.e. V). 

6.15 Let Xi,..., X„ be iid according to a distribution from V = {1/(0, 9), 9 > 0), and 
let Vo be the subfamily of V for which 9 is rational. Show that every 'Po-null set in the 
sample space is also a T’-null set. 

6.16 Let Xi, ..., X„ be iid according to a distribution from a family V. Show that T is 
minimal sufficient in the following cases: 

(a) V = {1/(0, 9), 9 > 0J; T = X w . 

(b) V = [U(6i, Of), -oo <9i<9 2 < oo); T = (X (1) , X w ). 

(c) V = [U(9 - 1/2, 9 + 1/2), -oo < 9 < oo); T = (X (1) , X ( „i). 

6.17 Solve the preceding problem for the following cases: 

(a) V = {E{9, 1), -oo < 9 < oo); T = X (1) . 

(b) V = {£(0,/>),0 < bf T = EX ; . 

(c) V = {E{a, b), —oo < a < oo, 0 < b}\ T = (X (1) , E[X, - X (1) ]). 
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6.18 Show that the statistics X (J ) and E[X,- — X (i )] of Problem 6.17(c) are independently 
distributed as E(a, b/n) and 7>Gamma(n — 2, 1) respectively. 

[Hint: ll « = 0 and b = 1, the variables F; = (n — i + 1)[X (I ) — X ( ,_d], i = 2, ..., n, are 
iid as E( .0, 1).] 

6.19 Show that the sufficient statistics of (i) Problem 6.3 and (ii) Problem 6.4 are minimal 
sufficient. 

6.20 (a) Show that in the N{8, 9) curved exponential family, the sufficient statistic T = 

x f) i s not minimal. 

(b) For the density of Example 6.19, show that T = x f > 1l 2 x ?) * s a minimal 

sufficient statistic. 

6.21 For the situation of Example 6.25(ii). find an unbiased estimator of § based on X ; , 

and another based on Xf); hence, deduce that T = (X ; , X?) is not complete. 

6.22 For the situation of Example 6.26, show that X is minimal sufficient and complete. 

6.23 For the situation of Example 6.27: 

(a) Show that X = (Xj, X 2 ) is minimal sufficient for the family (6.16) with restriction 
(6.17). 

(b) Establish (6.18), and hence that the minimal sufficient statistic of part (a) is not 
complete. 

6.24 (Messig and Strawderman 1993) Show that for the general dose-response model 

PeW = fl ) ['feW)P [1 - rutWl*-* , 

the statistic X = (Xi, X 2 ,..., X„) is minimal sufficient if there exist vectors 
6\, 6 * 2 , • • •, 9 m ) such that the m x m matrix 

l \%W)[l - 

is invertible. (Hint: Theorem 6.12.) 

6.25 Let (X,-, F ; ), i = 1,.... n, be iid according to the uniform distribution over a set R in 
the (,v, y) plane and let V be the family of distributions obtained by letting R range over 
a class 1Z of sets R. Determine a minimal sufficient statistic for the following cases: 

(a) 1Z is the set of all rectangles a\ < x < a-i, b\ < y < b 2 , —00 < n, < a 2 < 00 , 
—00 < b\ < &2 < 00 - 

(b) 7 Z' is the subset of TZ, for which a 2 — «i = ^2 — b\. 

(c) TZ" is the subset of TZ' for which a 2 — a\ = b 2 — b t = 1. 

6.26 Solve the preceding problem if 

(a) TZ is the set of all triangles with sides parallel to the x axis, the y axis, and the line 
y = x, respectively. 

(b) TZ' is the subset of TZ in which the sides parallel to the x and y axes are equal. 

6.27 Formulate a general result of which Problems 6.25(a) and 6.26(a) are special cases. 

6.28 If Y is distributed as E(ri, 1), the distribution of X = e~ Y is 1/(0, e~ rl ). (This result is 
useful in the computer generation of random variables; see Problem 4.4.14.) 
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6.29 If a minimal sufficient statistic exists, a necessary condition for a sufficient statistic to 
be complete is for it to be minimal. [ Hint : Suppose that T = h(U ) is minimal sufficient 
and U is complete. To show that U is equivalent to T, note that otherwise there exists 
i jr such that \jr(U) ^ ri[li(U)\ with positive probability where ;;(t) = E[ij:{U)\t].\ 

6.30 Show that the minimal sufficient statistics T = (X ( i), X in) ) of Problem 6.16(b) are 
complete. [Hint: Use the approach of Example 6.24.] 

6.31 For each of the following problems, determine whether the minimal sufficient statistic 
is complete: (a) Problem 6.7(a)-(c); (b) Problem 6.25(a)-(c); (c) Problem 6.26(a) and 
(b). 

6.32 (a) Show that if Vo, Vi are two families of distributions such that Vo € V\ and every 

null set of Vo is also a null set of V\ , then a sufficient statistic T that is complete 
for Vo is also complete for V\. 

(b) Let Vo be the class of binomial distributions b(p, n), 0 < p < 1, n = fixed, and let 
V\ = Vo U [Q] where Q is the Poisson distribution with expectation 1. Then Vo is 
complete but V\ is not. 

6.33 Let X\, ... ,X n be iid each with density f(x) (with respect to Lebesgue measure), 
which is unknown. Show that the order statistics are complete. 

[Hint: Use Problem 6.32(a) with Vo the class of distributions of Example 6.15(iv). 
Alternatively, let Vo be the exponential family with density 

C{0 U .... J 

6.34 Suppose that Xj.X„ are an iid sample from a location-scale family with distri¬ 

bution function F((x — a)/b). 

(a) If b is known, show that the differences (Xj — X,)/7>,; = 2, ..., n, are ancillary. 

(b) If a is known, show that the ratios (Xj — a)/(X,- — a), i = 2, ..., /?, are ancillary. 

(c) If neither a or b are known, show that the quantities (X! — X,)/(X 2 — X ; ), i = 
3,..., n, are ancillary. 

6.35 Use Basu’s theorem to prove independence of the following pairs of statistics: 

(a) X and E(X, — X) 2 where the X’s are iid as iV(£, ff 2 ). 

(b) X (1) and E[X, - X (1) ] in Problem 6.18. 

6.36 (a) Under the assumptions of Problem 6.18, the ratios Z, = [X ( „) — X (j )]/X ( „) — 

X ( „_i)], i = 1, ..., n — 2, are independent of (X ( i), E[X,- — X(i)]]. 

(b) Under the assumptions of Problems 6.16(b) and 6.30 the ratios Z,- = [ X (i > — 
X(i)]/X ( „) — X(i)], i = 2, — 1, are independent of (X (1) , X (n) ). 

6.37 Under the assumptions of Theorem 6.5, let A be any fixed set in the sample space, P e * 
the distribution Pq truncated on A, and V* = [Pg, 9 e £2). Then prove 

(a) if T is sufficient for V, it is sufficient for V*. 

(b) if, in addition, T is complete for V. it is also complete for V*. 

Generalizations of this result were derived by Tukey in the 1940s and also by Smith 
(1957). The analogous problem for observations that are censored rather than truncated 
is discussed by Bhattacharyya, Johnson, and Mehrotra (1977). 

6.38 If X!, ..., X„ are iid as B(a, b), 

(a) Show that [I1X ; , 11(1 — X,)] is minimal sufficient for (a, b). 

(b) Determine the minimal sufficient statistic when a = b. 
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Section 7 

7.1 Verify the convexity of the functions (i)-(vi) of Example 7.3. 

7.2 Show that x p is concave over (0, oo) if 0 < p < 1. 

7.3 Give an example showing that a convex function need not be continuous on a closed 
interval. 

7.4 If d> is convex on (a, b) and i jr is convex and nondecreasing on the range of <j>, show 
that the function \[r[<p(x)] is convex on (a, b ). 

7.5 Prove or disprove by counterexample each of the following statements. If (f> is convex 
on (a, b), then so is (i) c^ (x) and (ii) log rj>(x) if <f> > 0. 

7.6 Show that if equality holds in (7.1) for some 0 < y < 1, then <j> is linear on \x, y], 

7.7 Establish the following lemma, which is useful in examining the risk functions of 
certain estimators. (For further discussion, see Casella 1990). 

Lemma 9.2 Let r : [0, oo) -4- [0, oo) be concave. Then , (i) r(t) is nondecreasing and 
(ii) r(t)/1 is nonincreasing. 

7.8 Prove Jensen’s inequality for the case that X takes on the values x \, ..., x„ with prob¬ 
abilities y„(Ey ; = 1) directly from (7.1) by induction over n. 

7.9 A slightly different form of the Rao-Blackwell theorem, which applies only to the 
variance of an estimator rather than any convex loss, can be established without Jensen’s 
inequality. 

(a) For any estimator <5(x) with var[<5(X)] < oo, and any statistic T, show that 

var[S(X)] = var[E(5(A)|T)] + £[var(5(X)| 7")]. 

(b) Based on the identity in part (a), formulate and prove a Rao-Blackwell type theorem 
for variances. 

(c) The identity in part (a) plays an important role in both theoretical and applied 
statistics. For example, explain how Equation (1.2) can be interpreted as a special 
case of this identity. 

7.10 Let U be uniformly distributed on (0, 1), and let F be a distribution function on the 
real line. 

(a) If F is continuous and strictly increasing, show that F~ l (U) has distribution func¬ 
tion F. 

(b) For arbitrary F , show that F~ l (U) continues to have distribution function F. 

[Hint: Take F~ l to be any nondecreasing function such that F~ l [F(x )] = x for all x for 
which there exists no x' ^ x with F(x') = F(x).] 

7.11 Show that the k-dimensional sphere < c is convex. 

7.12 Show that f(a ) = *J\x — a \ + *J\y — a \ is minimized by a = x and a = y. 

7.13 (a) Show that <j>(x) = e Ex ' is convex by showing that its Hessian matrix is positive 

semidefinite. 

(b) Show that the result of Problem 7.4 remains valid if 0 is a convex function defined 
over an open convex set in E k . 

(c) Use (b) to obtain an alternative proof of the result of part (a). 

7.14 Determine whether the following functions are super- or subharmonic: 

( a ) i*f. P < !>■*; > 0. 
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(b) e -£?=i*, 2 . 

(c) log (nil 4 

7.15 A function is lower semicontinuous at the point y if f(y) < lim mf x _> y f{x). The 
definition of superharmonic can be extended from continuous to lower semicontinuous 
functions. 


(a) Show that a continuous function is lower semicontinuous. 

(b) The function fix) = I (a < x < b) is superharmonic on (—oo, oo). 

(c) For an estimator d of 9, show that the loss function 


L(d, d ) = 


0 

2 


if \d — 9\ < k 
if \d — 9\> k 


is subharmonic. 


7.16 (a) If / : -*■ iff is superharmonic, then is also superharmonic, where 

(p : iff —>■ Of is a twice-differentiable increasing concave function. 

(b) If h is superharmonic, then h*{x) = f g(x — y)hfy)dy is also superharmonic, where 
g(-) is a density. 

(c) If h Y is superharmonic, then so is h*(.x) = f h Y {x)dG{y) where G{y) is a distribu¬ 
tion function. 

(Assume that all necessary integrals exist, and that derivatives may be taken inside the 
integrals.) 

7.17 Use the convexity of the function (f> of Problem 7. 13 to show that the natural parameter 
space of the exponential family (5.2) is convex. 

7.18 Show that if / is defined and bounded over (—oo, oo) or (0, oo), then f cannot be 
convex (unless it is constant). 

7.19 Show that 0(x, y) = —^fxy is convex over x > 0, y > 0. 

7.20 If / and g are real-valued functions such that / 2 , g 2 are measurable with respect to 
the a-finite measure /x, prove the Schwarz inequality 

fgdfxj < J f : dn J g 2 dfi. 

[Hint: Write f fgdpt = Eg(f/g), where Q is the probability measure with dQ = 
g 2 dfi/ f g 2 d(i, and apply lensen's inequality with ip{x) = x 2 .] 

7.21 Show that the loss functions (7.24) are continuously differentiable. 

7.22 Prove that statements made in Example 7.20(i) and (ii). 

7.23 Let f be a unimodal density symmetric about 0, and let L{9, d) = p{d — 9) be a loss 
function with p nondecreasing on (0, oo) and symmetric about 0. 


(a) The function <f>(a) = E[p(X — a)] defined in Theorem 7.15 takes on its minimum 
at 0. 

(b) If 

S a = {.r : [p{x + a) — p(x — a)][f(x + a) — f(x — a)] =f 0), 
then 0(a) takes on its unique minimum value at a = 0 if and only if there exists ao 
such that 0(a o ) < oo, and /i(.S a ) > 0 for alia. [Hint: Note that 0(0) < l/2[0(2a) + 
0(—2a)], with strict inequality holding if and only if fi(S a ) > 0 for all a.] 



1.9] 


PROBLEMS 


75 


7.24 (a) Suppose that / and p satisfy the assumptions of Problem 7.23 and that / is 

strictly decreasing on [0, oo). Then, if <p{ao) < oo for some ao, <j>(a) has a unique 
minimum at zero unless there exists c < d such that 

p( 0) = c and p(x) = d for all x f 0. 

(b) If p is symmetric about 0, strictly increasing on [0, oo), and <p(a 0 ) < oo for some 
ao, then < j>(a) has a unique minimum at (0) for all symmetric unimodal /. 
[Problems 7.23 and 7.24 were communicated by Dr. W.Y. Loh.] 

7.25 Let p be a real-valued function satisfying 

0 < p(t ) < M < oo and pit) — M as t —> ±oo, 

and let X be a random variable with a continuous probability density /. Then tp(a) = 
E[p(X — 1)] attains its minimum. [ Hint : Show that (a) <p(a) —>■ M as a —*■ ±oo and 
(b) $ is continuous. Here, (b) follows from the fact (see, for example, TSH2, Appendix, 
Section 2) that if /„, n = 1, 2,..., and / are probability densities such that /„ ix) -*■ f(x) 
a.e., then f \[rf n —>■ f \[rf for any bounded i/r.] 

7.26 Let tf> be a strictly convex function defined over an interval I (finite or infinite). If 
there exists a value ao in / minimizing tj>(a ), then ao is unique. 

7.27 Generalize Corollary 7.19 to the case where X and fi are vectors. 


Section 8 

8.1 (a) Prove Chebychev’s Inequality: For any random variable X and non-negative func¬ 

tion g(-), 

P(g(X) >e)< -Eg(X) 
e 

for every e > 0 . (In many statistical applications, it is useful to take g(x ) = 
(x — a) 2 /b 2 for some constants a and b.) 

(b) Prove Lemma 9.3. [Hint: Apply Chebychev’s Inequality.] 

Lemma 9.3 A sufficient condition for Y n to converge in probability to c is that E{Y„ — 
c) 2 0. 

8.2 To see that the converse of Theorem 8.2 does not hold, let X\, ..., X n be iid with 
E(Xj) = 6, var(X,) = a 2 < oo, and let S„ = X with probability 1 — e„ and S„ = A„ with 
probability e„. If e„ and A„ are constants satisfying 

e n —y 0 and e„A„ —r oo, 

then <5„ is consistent for estimating 6, but E(S n — S) 2 does not tend to zero. 

8.3 Suppose p(x) is an even function, nondecreasing and non-negative for x > 0 and 
positive for x > 0. Then, E{p[& n — g(#)]} —»■ 0 for all 9 implies that S n is consistent for 
estimating g(9). 

8.4 (a) If A n , B n , and Y„ tend in probability to a, b, and y, respectively, then A n + B„Y n 

tends in probability to a + by. 

(b) If A„ takes on the constant value a„ with probability 1 and a n -*■ a, then A„ —y a 
in probability. 

8.5 Referring to Example 8.4, show that c n S 2 —r a 2 for any sequence of constants c„ -*■ 1. 
In particular, the MLE cr 2 = ^ S 2 is a consistent estimator of a 2 . 

8.6 Verify Equation (8.9). 
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8.7 If {«„} is a sequence of real numbers tending to a, and if b n = (a t + ■ ■ ■ + a n )/n , then 
b„ —> a. 

8.8 (a) If <5„ is consistent for 6, and g is continuous, then g(S n ) is consistent for g(9). 

(b) Let X u ..., X n be iid as N(9, 1), and let g(9) = 0 if 9 f 0 and g(0) = 1. Find a 

consistent estimator of g(9). 

8.9 (a) In Example 8.5, find cov(X ; , Xj) for any i ^ j. 

(b) Verify (8.10). 

8.10 (a) In Example 8.5. find the value of p\ for which p k becomes independent of k. 

(b) If p\ has the value given in (a), then for any integers i\ < ■ ■ ■ < i r and k, the joint 

distribution of X,j, ..., X ir is the same as that of X lt+k , ..., X tr+k . 

[Hint. Do not calculate, but use the definition of the chain.] 

8.11 Suppose Xi,..., X n have a common mean f and variance a 2 , and that cov(X,-, Xj) = 
pj-i- For estimating £, show that: 

(a) X is not consistent if pj- t = p f 0 for all i j\ 

(b) X is consistent if |p^,j < My j ~ l with \y\ < 1. 

[Hint: (a) Note that var(X) > 0 for all sufficiently large n requires p > 0, and determine 
the distribution of X in the multivariate normal case.] 

8.12 Suppose that — g(9)] tends in law to a continuous limit distribution H. Prove 
that: 

(a) If k' n /k„ —> d ft 0 or oo, then ^',[5,, — g(9)] also tends to a continuous limit 
distribution. 

(b) If k' n /k n 0 or oo, then k' n [8 n — g(9 )] tends in probability to zero or infinity, 
respectively. 

(c) If k„ -*■ oo, then S n g(9) in probability. 

8.13 Show that if Y n -¥ c in probability, then it tends in law to a random variable Y which 
is equal to c with probability 1. 

8.14 (a) In Example 8.7(i) and (ii), Y„ 0 in probability. Show that: 

(b) If H„ denotes the distribution function of Y„ in Example 8.7(i) and (ii), then 
H n (a) —> 0 for all a < 0 and H„(a) —»■ 1 for all a > 0. 

(c) Determine lim H„( 0) for Example 8.7(i) and (ii). 

8.15 If T„ > 0 satisfies *Jn[T n — P] —>■ N( 0, r 2 ), find the limiting distribution of (a) \ff n 
and (b) log T n (suitably normalized). 

8.16 If T n satisfies n[T„ — 9} —*■ N( 0, r 2 ), find the limiting distribution of (a) 7] 2 , (b) 
log | T n \, (c) 1 /T„, and (d) e Tn (suitably normalized). 

8.17 Variance stabilizing transformations are transformations for which the resulting statis¬ 
tic has an asymptotic variance that is independent of the parameters of interest. For each 
of the following cases, find the asymptotic distribution of the transformed statistic and 
show that it is variance stabilizing. 

(a) T n = i E;Li X,, X, -PoissonW, h(T n ) = -Jf n . 

(b) T„ = i E"=I x i ~Bernoulli(p), h(T„) = arcsinx/T„. 

8.18 (a) The function v(-) is a variance stabilizing transformation if the estimator v(T„) 

has asymptotic variance r 2 (9)[v'(9)] 2 = c, where c is a constant independent of 9. 
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(b) For any positive integer n , find the variance stabilizing transformation if t 2 (0) = 9". 

In particular, be careful of the important case n = 2. 

[A variance stabilizing transformation (if it exists) is the solution of a differential equation 
resulting from the Delta Method approximation of the variance of an estimator (Theorem 
8 .12) and is not a function of the distribution of the statistic (other than the fact that the 
distribution will determine the form of the variance). The transformations of part (b) are 
known as the Box-Cox family of power transformations and play an important role in 
applied statistics. For more details and interesting discussions, see Bickel and Doksum 
1981, Box and Cox 1982, and Hinkley and Runger 1984.] 

8.19 Serfling (1980, Section 3.1) remarks that the following variations of Theorem 8.12 
can be established. Show that: 


(a) If h is differentiable in a neighborhood of 9, and h’ is continuous at 6, then h'(8) 
may be replace by h’(T„) to obtain 

r [h(T n )-hm c .. 

-DTT77-► N( 0, 1). 

rh’(T „) 


(b) Furthermore, if t 2 is a continuous function of 9, say r 2 (8), it can be replaced by 
r 2 (T„) to obtain 

r [h{T n ) - h{9)] c .... .. 

-Jn ->■ N( 0, 1). 


r(T n )h'(T n ) 


8.20 Prove Theorem 8.16. 


[Hint: Under the assumptions of the theorem we have the Taylor expansion 


h(x l, ..., x s ) = /t(£i,..., § s ) + EC*,- - ft) — 


r dh i 

La?. J 


where Rj —> 0 as jq —» ft.] 


8.21 A sequence of numbers R n is said to be o(l/k„) as n —> oo if k„R„ —»• 0 and to be 
0(l/k n ) if there exist M and n 0 such that \k n R n \ < M for all n > n 0 or, equivalently, 
if k„ R„ is bounded. 


(a) If R„ = o(l/k„), then R„ = 0(1 /k„). 

(b) R„ = 0(1) if and only if R n is bounded. 

(c) R„ = o(l) if and only if R„ -*■ 0. 

(d) If R„ is 0(1/ k n ) and k' n /k n tends to a finite limit, then R n is 0(1/k' n ). 

8.22 (a) If R n and R' n are both 0(l/k„), so is R„ + R’ n . 

(b) If R„ and R' n are both o(\/k n ), so is R n + R' n . 

8.23 Suppose k' n /k n —> oo. 

(a) If R„ = 0(1/1',,) and R' n = 0(1/1'). then R„ + R' n = 0(1/1„). 

(b) If R„ = 0(1/*,) and R' n = o(l/k'j, then R„ + R' n = o(l/k n ). 

8.24 A sequence of random variables Y„ is bounded in probability if given any e > 0, there 
exist M and hq such that P(| Y„ \ > M) < e for all n > no- Show that if Y„ converges in 
law, then T„ is bounded in probability. 

8.25 In generalization of the notation o and O, let us say that T„ = o p (l/k n ) if k n Y n -*■ 0 
in probability and that T„ = O p (\/k n ) if k n Y„ is bounded in probability. Show that the 
results of Problems 8.21 - 8.23 continue to hold if o and O are replaced by o p and O p . 
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8.26 Let (X„, Y n ) have a bivariate normal distribution with means E{X„) = E(Y„ ) = 0, 
variances E{X^) = E(Y^) = 1, and with correlation coefficient p„ tending to 1 as 
n —> oo. 

(a) Show that (X„, F„) 4 (X, T) where X is N( 0, 1) and P(X = Y)= 1. 

(b) If S = {(jc, y) : x = v}, show that (8.25) does not hold. 

8.27 Prove Theorem 8.22. [Hint: Make a Taylor expansion as in the proof of Theorem 8.12 
and use Problem 4.16.] 

10 Notes 

10.1 Fubini’s Theorem 

Theorem 2.8, called variously Fubini’s or Tonelli’s theorem, is often useful in mathe¬ 
matical statistics. A variant of Theorem 2.8 allows / to be nonpositive, but requires an 
integrability condition (Billingsley 1995, Section 18). Dudley (1989) refers to Theorem 
2.8 as the Tonelli-Fubini theorem and recounts an interesting history in which Lebesgue 
played a role. Apparently, Fubini’s first published proof of this theorem was incorrect 
and was later corrected by Tonelli, using results of Lebesgue. 

10.2 Sufficiency 

The concept of sufficiency is due to Fisher (1920). (For some related history, see Stigler 
1973.). In his fundamental paper of 1922, Fisher introduced the term sufficiency and 
stated the factorization criterion. The criterion was rediscovered by Neyrnan (1935) and 
was proved for general dominated families by Halmos and Savage (1949). The theory of 
minimal sufficiency was initiated by Lehmann and Scheffe (1950) and Dynkin (1951). 
Further generalizations are given by Bahadur (1954) and Landers and Rogge (1972). 
Yarnada and Morimoto (1992) review the topic. Theorem 7.8 with squared error loss 
is due to Rao (1945) and Blackwell (1947). It was extended to the pth power of the 
error (p > 1) by Barankin (1950) and to arbitrary convex loss functions by Hodges and 
Lehmann (1950). 

10.3 Exponential Families 

One-parameter exponential families, as the only (regular) families of distributions for 
which there exists a one-dimensional sufficient statistic, were also introduced by Fisher 
(1934). His result was generalized to more than one dimension by Darmois (1935), 
Koopman (1936), and Pitman (1936). (Their contributions are compared by Barankin 
and Maitra (1963).) Another discussion of this theorem with reference to the literature is 
given, for example, by Hipp (1974). Comprehensive treatments of exponential families 
are provided by Barndorff-Nielsen (1978) and Brown (1986a); a more mathematical 
treatment is given in Hoffman-Jorgenson (1994). Statistical aspects are emphasized in 
lohansen (1979). 

10.4 Ancillarity 

To illustrate his use of ancillary statistics, group families were introduced by Fisher 
(1934). (For more information on ancillarity, see Buehler 1982, or the review article by 
Lehmann and Scholtz 1992).) 

Ancillary statistics, and more general notions of ancillarity, have played an important 
role in developing inference in both group families and curved exponential families, 
the latter having connections to the field of “small-sample asymptotics,” where it is 
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shown how to obtain highly accurate asymptotic approximations, based on ancillaries 
and saddlepoints. 

For example, as curved exponential families are not of full rank, it is typical that a minimal 
sufficient statistic is not complete. One might hope that an 5-dimensional sufficient 
statistic could be split into a rf-dimensional sufficient piece and an 5 — rf-dimensional 
ancillary piece. Although this cannot always be done, useful decompositions can be 
found. Such endeavors lie at the heart of conditional inference techniques. 

Good introductions to these topics can be found in Reid (1988), Field and Ronchetti 
(1990), Flinkley, Reid, and Snell (1991), Barndorff-Nielsen and Cox (1994), and Reid 
(1995). 

10.5 Completeness 

Completeness was introduced by Lehmann and Scheffe (1950). Theorem 6.21 is due to 
Basu (1955b, 1958). Although there is no converse to Basu’s theorem as stated here, 
some alternative definitions and converse results are discussed by Lehmann (1981). 
There are alternate versions of Theorem 6.22, which relate completeness in exponential 
families to having full rank. This is partially due to the fact that a full or full-rank 
exponential family can be defined in alternate ways. For example, referring to (5.1), if 
we define 0 as the index set of the densities pg(x), that is, we consider the family of 
densities [pg(x), 9 e ©), then Brown (1986a, Section 1.1) defines the exponential family 
to be full if 0 = 3, where 3 is the natural parameter space [see (5.3)]. But this property 
is not needed for completeness. As Brown (1986a. Theorem 2.12) states, as long as the 
interior of 0 is nonempty (that is, 0 contains an open set), the family {p s (x), 9 e 0) is 
complete. Another definition of a full exponential model is given by Barndorff-Nielsen 

and Cox (1994, Section 1.3), which requires that the statistics 7). T s not be linearly 

dependent. 

In nonparametric families, the property of completeness, and determination of complete 
sufficient statistics, continues to be investigated. See, for example. Mandelbaum and 
Riischendorf (1987) and Mattner (1992. 1993,1994). For example, building on the work 
of Fraser (1954) and Mandelbaum and Riischendorf (1987), Mattner (1994) showed that 
the order statistics are complete for the family of densities V, in cases such as 

(a) 'P={all probability measures on the real line with unimodal densities with respect 
to Lebesgue measure). 

(b) V = {(1 — t)P + tQ : P e V, Q e Q(P), t e [0, e]}, where e is fixed and. for each 
P eV. P is absolutely continuous with respect to the complete and convex family 

Q(P). 

10.6 Curved Exponential Families 

The theory of curved exponential families was initiated by Efron (1975, 1978), who 
applied the ideas of plane curvature and arc length to better understand the structure of 
exponential families. Curved exponential families have been extensively studied since 
then. (See, for example. Brown 1986a. Chapter 3; Barndorff-Nielsen 1988; McCul- 
lagh and Nelder 1989; Barndorff-Nielsen and Cox 1994, Section 2.10.) Here, we give 
some details in a two-dimensional case; extensions to higher dimensions are reasonably 
straightforward (Problem 5.4). 

For the exponential family (5.1), with 5 = 2, the parameter is (rp(9), t] 2 (9)), where 9 
is an underlying parameter which is indexing the parameter space. If 9 itself is a one¬ 
dimensional parameter, then the parameter space is a curve in two dimensions, a subset 
of the full two-dimensional space. Assuming that the rjfs have at least two derivatives 
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as functions of 6, the parameter space is a one-dimensional differentiable manifold, a 
differentiable curve. (See Amari et al 1987 or Murray and Rice 1993 for an introduction 
to differential geometry and statistics.) 


Figure 10.1. The cun’e ;/(r) = (?; i(t), / 72 (f)) = (r, — \x 2 ). The radius of curvature y T is the 
instantaneous rate of change of the angle A a, between the derivatives Vtj( r), with respect 
to the arc length As. The vector Vr]{x), the tangent vector, and the unit normal vector 
N(x) = [—/ 72 (f), t]\(x)]/[ds r / dx\ provide a moving frame of reference. 


Normal Curved Exponential Family 



Example 10.1 Curvature. For the exponential family (5.7) let r = j, so the parameter 
space is the curve 

/7(f) = (f ~ x 2 ), 

as shown in Figure 10.1. The direction of the curve r/( t), at any point r, is measured by 
the derivative vector (the gradient) V?;(r) = ( 77 ', (t), ff(x)) = (1, — r). At each x we can 
assign an angular value 


a(x) = polar angle of normalized gradient vector Vt](x) 

(OiW- /72(f)) 


= polar angle of 


[(7i) 2 + (t?2) 2 ] 1/2 ’ 


which measures how the curve “bends.” The curvature, y r , is a measure of the rate of 
change of this angle as a function of the arc length s{x), where s(x) = f Q \'Vr)(t)\dt. 
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Thus 

( 10 . 1 ) 


Yr 


■ lim 


a{r + Sr) — a{r) da( r) 


s^o s(r + Sr) — s(r) ds(r)’ 
see Figure 10.1. An application of calculus will show that 


( 10 . 2 ) 


Yr 


ridh ~ ViVi 


[0/i) 2 + 0?2) 2 ] 3/2 ’ 

so for the exponential family (5.7), we have y r = —(1 + r 2 ) 3/2 . 


For the most part, we are only concerned with \ y T |, as the sign merely gives the direction 
of parameterization, and the magnitude gives the degree of curvature. As might be 
expected, lines have zero curvature and circles have constant curvature. The curvature 
of a circle is equal to the reciprocal of the radius, which leads to calling l/|j/ t | the 
radius of curvature. Definitions of arc length, and so forth, naturally extend beyond two 
dimensions. (See Problems 5.5 and 5.4.) 

10.7 Large Deviation Theory 

Limit theorems such as Theorem 1.8.12 refer to sequences of situations as n —»■ oo. 
Flowever, in a given problem, one is dealing with a specific large value of n. Any 
particular situation can be embedded in many different sequences, which lead to different 
approximations. 

Suppose, for example, that it is desired to find an approximate value for 

(10.3) P(\T n -g(e)\>a) 

when n = 100 and a = 0.2. If «Jn[T„ — g(6)\ is asymptotically normally distributed 
as N{ 0, 1), one might want to put a = c/^/n (so that c = 2) and consider (10.3) as a 
member of the sequence 

> ^ \ 2[1 _ ^ 

sfn) 

Alternatively, one could keep a = 0.2 fixed and consider (10.3) as a member of the 
sequence 

(10.5) P{\T n -g{6)\ >0.2). 

Since T„ — g(0) -*■ 0, this sequence of probabilities tends to zero, and in fact does so at 
a very fast rate. In this approach, the normal approximation is no longer useful (it only 
tells us that (10.5) —>■ 0 as n —> oo). The study of the limiting behavior of sequences 
such as (10.5) is called large deviation theory. An exposition of large deviation theory 
is given by Bahadur (1971). Books on large deviation theory include those by Kester 
(1985) andBucklew (1990). Much research has been done on this topic, and applications 
to various aspects of point estimation can be found in Fu (1982), Kester and Kallenberg 
(1986), Sieders and Dzhaparidze (1987), and Pfanzagl (1990). 

We would, of course, like to choose the approximation that comes closer to the true 
value. It seems plausible that for values of (10.3 ) not extremely close to 0 and for mod¬ 
erate sample sizes, (10.4) would tend to do better than that obtained from the sequence 
(10.5). Some numerical comparisons in the context of hypothesis testing can be found 
in Groeneboom and Oosterhoff (1981); other applications in testing are considered in 
Barron (1989). 


(10.4) 


P\\T n -g(6)\ 
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CHAPTER 2 


Unbiasedness 


1 UMVU Estimators 

It was pointed out in Section 1.1 that estimators with uniformly minimum risk 
typically do not exist, and restricting attention to estimators showing some degree 
of impartiality was suggested as one way out of this difficulty. As a first such 
restriction, we shall study the condition of unbiasedness in the present chapter. 

Definition 1.1 An estimator 8{x) of g(6) is unbiased if 

(1.1) E 0 [i 5(Y)] = g(9) for all OeQ. 

When used repeatedly, an unbiased estimator in the long run will estimate the 
right value “on the average.” This is an attractive feature, but insistence on unbi¬ 
asedness can lead to problems. To begin with, unbiased estimators of g may not 
exist. 

Example 1.2 Nonexistence of unbiased estimator. Let X be distributed accord¬ 
ing to the binomial distribution b( p, n) and suppose that gip) = 1/ p. Then, unbi¬ 
asedness of an estimator S requires 

(1.2) S(k) l n .\ p k q"~ k = g(p) for all 0 < p < 1. 
k=0 V K ' 

That no such S exists can be seen, for example, for the fact that as p —»■ 0, the left 
side tends to <$(0) and the right side to oo. Yet, estimators of 1/p exist which (for 
n not too small) are close to 1 /p with high probability. For example, since X/n 
tends to be close to p,n/X (with some adjustment when A' = 0) will tend to be 
close to \/p. ) 

If there exists an unbiased estimator of g, the estimand g will be called U - 
estimable. (Some authors call such an estimand “estimable,” but this conveys the 
false impression that any g not possessing this property cannot be accurately esti¬ 
mated.) Even when g is U -estimable there is no guarantee that any of its unbiased 
estimators are desirable in other ways, and one may instead still prefer to use an 
estimator that does have some bias. On the other hand, a large bias is usually con¬ 
sidered a drawback and special methods of bias reduction have been developed 
for such cases. 

Example 1.3 The jackknife. A general method for bias reduction was initiated 
by Quenouille (1949, 1956) and later named the jackknife by Tukey (1958). Let 
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T (x) be an estimator of a parameter r (6) based on a sample x = (x\,..., x n ) and 
satisfying ^[^(x)] = r (0) + O(-). Define X(_q to be the vector of sample values 
excluding x t . Then, the jackknifed version of T(x) is 

^ j n 

(13) Tj(x) = nT (x) - — 

n i=i 

It can be shown that E[T /(x)] = r(0)+ 0(4-), so the bias has been reduced (Stuart 
and Ord 1991, Section 17.10; see also Problem 1.4). || 

Although unbiasedness is an attractive condition, after a best unbiased estimator 
has been found, its performance should be investigated and the possibility not ruled 
out that a slightly biased estimator with much smaller risk might exist (see, for 
example. Sections 5.5 and 5.6). 

The motive for introducing unbiasedness was the hope that within the class 
of unbiased estimators, there would exist an estimator with uniformly minimum 
risk. In the search for such an estimator, a natural approach is to minimize the 
risk for some particular value 6q and then see whether the result is independent of 
9q. To this end, the following obvious characterization of the totality of unbiased 
estimators is useful. 

Lemma 1.4 If 8 q is any unbiased estimator of g(6), the totality of unbiased esti¬ 
mators is given by 8 = 8 q — U where U is any unbiased estimator of zero, that is, 
it satisfies 

E e (U) = 0 for all de^2. 

To illustrate this approach, suppose the loss function is squared error. The risk 
of an unbiased estimator S is then just the variance of 8. Restricting attention to 
estimators Sq, 8, and U with finite variance, we have, if <5o is unbiased, 

var(S) = var(<5o - U) = E(8 0 - U) 1 - [g(9)] 2 

so that the variance of 8 is minimized by minimizing E(Sq — U) 2 . 

Example 1.5 Locally best unbiased estimation. Let X take on the values —1,0, 
1,... with probabilities (Problem 1.1) 

(1.4) P{X = -\)=p, P(X = k) = q 2 p k , k = 0,1,..., 

where 0 < p < 1 and q = 1 — p, and consider the problems of estimating (a) p 
and (b) q 2 . Simple unbiased estimators of p and q 2 are, respectively, 

„ 1 if X = -l J 0 1 if X = 0 

<5n — r» , • and oi — . 

0 otherwise 0 otherwise. 

It is easily checked that U is an unbiased estimator of zero if and only if [Problem 
1.1(b)] 

(1.5) U(k) = -kU(- 1) fork = 0,1,... 

or equivalently if U(k) = ak for all k = — 1, 0, 1,... and some a. The problem 
of determining the unbiased estimator which minimizes the variance at po thus 
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reduces to that of determining the value of a which minimizes 
(1.6) T.P(X = k)[8i(k)-ak] 2 . 


The minimizing values of a are (Problem 1.2) 


= ~Po/ 


Po + % k2 Po 


k= 1 


and a* = 0 


in cases (a) and (b), respectively. Since a* does not depend on p o, the estimator 
<5* = <5i — a*X = 8 1 minimizes the variance among all unbiased estimators not 
only when p = po but for all values of p. On the other hand, S q = Sq — a^X does 
depend on po , and it therefore only minimizes the variance at p = po. j 


The properties possessed by <5 q and <Sj are characterized more generally by the 
following definition. 

Definition 1.6 An unbiased estimator 8(x) of g(9) is the uniform minimum vari¬ 
ance unbiased (UMVU) estimator of g{0) if v&xg8{x) < vargVOr) for all 9 e Q, 
where 8'{x) is any other unbiased estimator of g(9). The estimator <5(x) is locally 
minimum variance unbiased (LMVU) at 6 = 9q if vattg a 8(x) < vare 0 <5'(.r) for any 
other unbiased estimator 8 r (x). 

In terms of Definition 1.6, we have shown in Example 1.5 that 8 j is UMVU and 
that <5 q is LMVU. Since <5 q depends on po. no UMVU estimator exists in this case. 

Notice that the definition refers to “the” UMVU estimator, since UMVU estima¬ 
tors are unique (see Problem 1.12). The existence, uniqueness, and characterization 
of LMVU estimators have been investigated by Barankin (1949) and Stein (1950). 
Interpreting E(8q — U) 1 as the distance between So and U, the minimizing U* 
can be interpreted as the projection of So onto the linear space U formed by the 
unbiased estimators U of zero. The desired results then follow from the projection 
theorem of linear space theory (see, for example, Bahadur 1957, and Luenberger 
1969). 

The relationship of unbiased estimators of g(9) with unbiased estimators of zero 
can be helpful in characterizing and determining UMVU estimators when they 
exist. Note that if S(V) is an unbiased estimator of g(9), then so is S(X) + aU(X), 
for any constant a and any unbiased estimator U of zero and that 

vare 0 [<S(V) + aU(X)] = var 0o 5(Z) + a 2 \zrg 0 U(X) + 2acov do (U(X), S(X)). 

If co \g(U(X), 5(20) f 0 for some 9 = 9q, we shall show below that there exists a 
value of a for which var,9 0 [(5(20 + af/(20] < vare 0 5(A). As a result, the covariance 
with unbiased estimators of zero is the key in characterizing the situations in which 
a UMVU estimator exists. In the statement of the following theorem, attention 
will be restricted to estimators with finite variance, since otherwise the problem of 
minimizing the variance does not arise. The class of estimators 8 with EgS 2 < oo 
for all 9 will be denoted by A. 

Theorem 1.7 Let X have distribution Pg.9 e £2, let 8 be an estimator in A, and 
let U denote the set of all unbiased estimators of zero which are in A. Then, a 



86 


UNBIASEDNESS 


[2.1 


necessary and sufficient condition for 8 to be a UMVU estimator of its expectation 
g(9) is that 

(1.7) Eg(8U) = 0 for all U e IA and all 9e£2. 

( Note: Since Eg(U) = 0 for all U e U, it follows that Eg(8U) = co\g(S, U ), so 
that (1.7) is equivalent to the condition that 8 is uncorrelated with every U e IA.) 

Proof. 

(a) Necessity. Suppose 8 is UMVU for estimating its expectation g(9). Fix U e 
U, 9 e £2, and for arbitrary real X, let S' = 8 + XU. Then, S' is also an unbiased 
estimator of g(9), so that 

vare(<5 + XU) > var^S) for all X. 

Expanding the left side, we see that 

X 2 vaigU + 2Xcovg(8, U) > 0 for all X, 

a quadratic in X with real roots X = 0 and X = —2 covg(S, U)/vaxg{U). It will 
therefore take on negative values unless cove (5, U) = 0. 

(b) Sufficiency. Suppose Eg(8U ) = 0 for all U eU. To show that 8 is UMVU, 
let 8' be any unbiased estimator of Eg(8). If var^S' = oo, there is nothing to 
prove, so assume var g8 r < oo. Then, 8 — 8' elA (Problem 1.8) so that 

E 0 [8(S - V)] = 0 

and hence E g (8 2 ) = E g (8S r ). Since 8 and 8' have the same expectation, 

var e i5 = cove(<5, 8 '), 

and from the covariance inequality (Problem 1.5), we conclude that varg(i5) < 
var 0 (<5')- 

□ 

The proof of Theorem 1.7 shows that condition (1.7), if required only for 9 = 9q, 
is necessary and sufficient for an estimator 8 with Eg 0 (8 2 ) < oo to be LMVU at 
@o- This result also follows from the characterization of the LMVU estimator as 
8 = <5o — U* where So is any unbiased estimator of g and U* is the projection of 
So onto U. Interpreting the equation Eg 0 (SU) = 0 as orthogonality of So and U, 
the projection of U* has the property that S = S () — U* is orthogonal to U , that is, 
Eg 0 (8U ) = 0 for all U e U. If the estimator is to be UMVU, this relation must 
hold for all 9. 

Example 1.8 Continuation of Example 1.5. As an application of Theorem 1.7, 
let us determine the totality of UMVU estimators in Example 1.5. In view of (1.5) 
and (1.7), a necessary and sufficient condition for S to be UMVU for its expectation 
is 

(1.8) E p (8X) = 0 for all p, 

that is, for SX to be in IA and hence to satisfy (1.5). This condition reduces to 
k8(k ) = k8{- 1) for k = 0,1,2,..., 
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which is satisfied provided 

(1.9) S(k) = <5(-l) for k= 1,2, ... 

with (5(0) being arbitrary. If we put 5(— 1) = a, 5(0) = b , the expectation of such a <5 
is g(p) = bq 1 +a{\ — q 1 ) and g(p) is therefore seen to possess a UMVU estimator 
with finite variance if and only if it is of the form a + cq 2 . | 

It is interesting to note, although we shall not prove it here, that Theorem 1.7 
typically, but not always, holds not only for squared error but for general convex 
loss functions. This result follows from a theorem of Bahadur (1957). For details, 
see Padmanabhan (1970) and Linnik and Rukhin (1971). 

Constants are always UMVU estimators of their expectations since the variance 
of a constant is zero. (If 8 is a constant, (1.7) is of course trivially satisfied.) Deleting 
the constants from consideration, three possibilities remain concerning the set of 
UMVU estimators. 

Case 1. No nonconstant U -estimable function has a UMVU estimator. 

Example 1.9 Nonexistence of UMVU estimator. Let Xi,... ,X n be a sample 
from a discrete distribution which assigns probability 1 /3 to each of the points 
6 — 1, 9, 0 + 1, and let 0 range over the integers. Then, no nonconstant function 
of 6 has a UMVU estimator (Problem 1.9). A continuous version of this example 
is provided by a sample from the uniform distribution U(9 — 1/2, 9 + 1/2); see 
Lehmann and Scheffe (1950, 1955, 1956). (For additional examples, see Section 
2.3.) || 


Case 2. Some, but not all, nonconstant U -estimable functions have UMVU esti¬ 
mators. Example 1.5 provides an instance of this possibility. 

Case 3. Every U -estimable function has a UMVU estimator. 

A condition for this to be the case is suggested by (Rao-Blackwell) Theorem 
1.7.8. If T is a sufficient statistic for the family V = {Pg,9 e £2} and g(9) is 
{/-estimable, then any unbiased estimator 8 of g{9) which is not a function of T 
is improved by its conditional expectation given T, say i](T). Furthermore, q (T) 
is again an unbiased estimator of g(9) since by (6.1), Eg[q(T)] = Eg[8(Xf\. 

Lemma 1.10 Let X be distributed according to a distribution from V = { Pi ,. 9 € 
f2}, and let T be a complete sufficient statistic for V. Then, every U-estimable 
function g(9) has one and only one unbiased estimator that is a function of T. 
(Here, uniqueness, of course, means that any two such functions agree a.e. V.) 

Proof. That such an unbiased estimator exists was established just preceding the 
statement of Lemma 1.10. If <5| and 8i are two unbiased estimators of g(9), their 
difference f(T) = 8\(T) — 8i(T) satisfies 


E e f(T) = 0 for all 0 e £2, 
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and hence by the completeness of T, S\(T) = SziT ) a.e. V, as was to be proved. 

□ 

So far, attention has been restricted to squared error loss. However, the Rao- 
Blackwell theorem applies to any convex loss function, and the preceding argument 
therefore establishes the following result. 

Theorem 1.11 Let X be distributed according to a distribution in V = {Pg, 9 e 
f2}, and suppose that T is a complete sufficient statistic for V. 

(a) For every U-estimable function g(9), there exists an unbiased estimator that 
uniformly minimizes the risk for any loss function L(6, d) which is convex in 
its second argument; therefore, this estimator in particular is UMVU. 

(b) The UMVU estimator of(i) is the unique unbiased estimator which is a func¬ 
tion ofT; it is the unique unbiased estimator with minimum risk, provided its 
risk is finite and L is strictly convex in d. 

It is interesting to note that under mild conditions, the existence of a complete 
sufficient statistic is not only sufficient but also necessary for Case 3. This result, 
which is due to Bahadur (1957), will not be proved here. 

Corollary 1.12 IfP is an exponential family of full rank given by (5.1), then the 
conclusions of Theorem 1.11 hold with 9 = (9\, .. ., 9 S ) and T = (7), ..., T s ). 

Proof. This follows immediately from Theorem 1.6.22. □ 

Theorem 1.11 and its corollary provide best unbiased estimators for large classes 
of problems, some of which will be discussed in the next three sections. For the 
sake of simplicity, these estimators will be referred to as being UMVU, but it 
should be kept in mind that their optimality is not tied to squared error as loss, but, 
in fact, they minimize the risk for any convex loss function. 

Sometimes we happen to know an unbiased estimator S of g(0) which is a 
function of a complete sufficient statistic. The theorem then states that S is UMVU. 
Suppose, for example, that X\,..., X n are iid according to N(f, a 2 ) and that the 
estimandis o 2 . The standard unbiased estimator of a 2 is then S = Y,(Xj — X) 2 /(n — 
1). Since this is a function of the complete sufficient statistic T = (EX,-, Y.(X, — 
X) 2 ), S is UMVU. Barring such fortunate accidents, two systematic methods are 
available for deriving UMVU estimators through Theorem 1.11. 


Method One: Solving for S 

If T is a complete sufficient statistic, the UMVU estimator of any U -estimable 
function g(9) is uniquely determined by the set of equations 

(1.10) E e 8(T ) = g(9) for all 9 e tt. 

Example 1.13 Binomial UMVU estimator. Suppose that T has the binomial 
distribution b(p, n ) and that g(p) = pq. Then, (1.10) becomes 

(1.11) y ~\( U \ 8(t)p' q n ~ x = pq for all 0 < p < 1. 

,=o\ r y 
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If p = p/q so that p = p/(l + p) and q = 1/(1 + p), (1.11) can be rewritten as 

X ) 8 ^P' = P(1 + P)” -2 = X l ) (0 < P < °°). 

r=o V 1 ' r=i \ 1 1 / 


A comparison of the coefficients on the left and right sides leads to 

t(n — t ) 


m ■ 


n(n — 1) 


Method Two: Conditioning 

If <)(X) is any unbiased estimator of g(P), it follows from Theorem 1.11 that 
the UMVU estimator can be obtained as the conditional expectation of S(X) given 
T. For this derivation, it does not matter which unbiased estimator <5 is being 
conditioned; one can thus choose S so as to make the calculation of S'(T) = 
Zs[i5(V)|r] as easy as possible. 

Example 1.14 UMVU estimator for a uniform distribution. Suppose that X\, 
..., X n are iid according to the uniform distribution 1/(0, P) and that g(6) = P/2. 
Then, T = X (n) , the largest of the X’s, is a complete sufficient statistic. Since 
E(X\) = P/2, the UMVU estimator of P/2 is E[X | \X (n) = t]. If X (n) = t, then 
X\ = t with probability 1 /«, and X \ is uniformly distributed on (0, t) with the 
remaining probability (n — 1 )/n (see Problem 1.6.2). Hence, 


1 


£[Vi|f ] = - • t + 
n 


n — 1 
n 


t 

2 


n + 1 t 
n 2 


Thus, [(« + 1 )/n] ■ T /2 and [(« + l)/«] • T are the UMVU estimators of P/2 and 
P, respectively. i 


The existence of UMVU estimators under the assumptions of Theorem 1.11 was 
proved there for convex loss functions. That the situation tends to be very different 
without convexity of the loss is seen from the following results of Basu (1955a). 

Theorem 1.15 Let the loss function L(9, d) for estimating g(0) be bounded, say 
L(6,d) < M, and assume that L[9, g(6)] = 0 for all 9, that is, the loss is zero when 
the estimated value coincides with the true value. Suppose that g is U-estimable 
and let 9q be an arbitrary value of 9. Then, there exists a sequence of unbiased 
estimators S„ for which R(9q , <5„) -> 0. 

Proof Since g{9) is U -estimable, there exists an unbiased estimator <)( X). For any 
0 < 7 t < 1, let 


KM = 


g(9o) with probability 1 — jt 


1 

— [<5(x) — g(P 0 )] + g(9 o) with probability n. 

it 


Then, S' n is unbiased for all it and all P, since 

EeiSf) = (1 - Tr)g(Po) + ~[g(9) - g(P 0 )] + Trg(Po) = g(9). 
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The risk R(9q, S' n ) at 9q is (1 — 7r)-0plus n times the expected loss of — 

g(0o)] + g(0o), so that 

R(0q, S' n ) < txM. 

As n —> 0, it is seen that R(9q, &' n ) -> 0. □ 

This result implies that for bounded loss functions, no uniformly minimum-risk- 
unbiased or even locally minimum-risk-unbiased estimator exists except in trivial 
cases, since at each 9q , the risk can be made arbitrarily small even by unbiased 
estimators. [Basu (1955a) proved this fact for a more general class of nonconvex 
loss functions.] The proof lends support to the speculation of Section 1.7 that the 
difficulty with nonconvex loss functions stems from the possibility of arbitrarily 
large errors since as tt -> 0, the error \8 n {x) — g(0o)l —>■ oo. It is the leverage of 
these large but relatively inexpensive errors which nullifies the restraining effect 
of unbiasedness. 

This argument applies not only to the limiting case of unbounded errors but 
also, although to a correspondingly lesser degree, to the case of finite large errors. 
In the latter situation, convex loss functions receive support from a large-sample 
consideration. To fix ideas, suppose the observations consist for n iid variables 
X \, ..., X n . As n increases, the error in estimating a given value g(6) will decrease 
and tend to zero as n oo. (See Section 1.8 for a precise statement.) Thus, 
essentially only the local behavior of the loss function near the true value g(9) is 
relevant. If the loss function is smooth, its Taylor expansion about d = g(9) gives 

L(9, d) = a(9) + b(9)[d - g(9)] + c(9)[d - g(9)] 2 + R, 

where the remainder R becomes negligible as the error | d — g(9) | becomes suf¬ 
ficiently small. If the loss is zero when d = g(9), then a must be zero, so that 
b(9)[d — g(0)] becomes the dominating term for small errors. The condition 
L(9, d) > 0 for all 9 then implies b(9) = 0 and hence 

L(9,d) = c(9)[d - g(9)] 2 + R. 

Minimizing the risk for large n thus becomes essentially equivalent to minimizing 
£[5(70 ~ g(9 }] 2 , which justifies not only a convex loss function but even squared 
error. Not only the loss function but also other important aspects of the behavior 
of estimators and the comparison of different estimators greatly simplify for large 
samples, as will be discussed in Chapter 6. 

The difficulty which bounded loss functions present for the theory of unbiased 
estimation is not encountered by a different unbiasedness concept, that of median 
unbiasedness mentioned in Section 1.1. For estimating g(9) in a multiparameter 
exponential family, it turns out that uniformly minimum risk median unbiased 
estimators exist for any loss function L for which L(9, d) is a nondecreasing 
function of d as d moves in either direction away from g(9). A detailed version 
of this result can be found in Pfanzagl (1979). We shall not discuss the theory of 
median unbiased estimation here since the methods required belong to the theory 
of confidence intervals rather than that of point estimation (see TSH2, Section 3.5). 
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2 Continuous One- and Two-Sample Problems 

The problem of estimating an unknown quantity 6 from n measurements of 9 
was considered in Example 1.1.1 as the prototype of an estimation problem. It 
was formalized by assuming that the n measurements are iid random variables 
Xi ,..., X„ with common distribution belonging to the location family 

(2.1) P 0 (X, <x)= F(x - 9). 


The problem takes different forms according to the assumptions made about F. 
Some possibilities are the following: 

(a) F is completely specified. 

(b) F is specified except for an unknown scale parameter. In this case, (2.1) will 
be replaced by a location-scale family. It will then be convenient to denote the 
location parameter by £ rather than 6 (to reserve 0 for the totality of unknown 
parameters) and hence to write the family as 

(2.2) P 0 (X, <x)=F “ * 

Here, it will be of interest to estimate both § and a. 

(c) The distribution of the X’s is only approximately given by Equation (2.1) or 

(2.2) with a specified F. What is meant by “approximately” leads to the topic 
of robust estimation 


fd) F is known to be symmetric about 0 (so that the X’s are symmetrically 
distributed about 6 or £) but is otherwise unknown. 

(e) F is unknown except that it has finite variance; the estimand is § = E(Xi). 

In all these models, F is assumed to be continuous. 

A treatment of Problems (a) and (b) for an arbitrary known F is given in Chapter 3 
from the point of view of equivariance. In the present section, we shall be concerned 
with unbiased estimation of 6 or (£, a) in Problems (a) and (b) and some of their 
generalizations for some special distributions, particularly for the case that F is 
normal or exponential. Problems (c), (d), and (e) all fall under the general heading 
of robust and nonparametric statistics (Huber 1981, Hampel et al. 1986, Staudte 
and Sheather 1990). We will not attempt a systematic treatment of these topics 
here, but will touch upon some points through examples. For example. Problem 
(e) will be considered in Section 2.4. 

The following three examples will be concerned with the normal one-sample 
problems, that is, with estimation problems arising when X \,..., X n are dis¬ 
tributed with joint density (2.3). 

Example 2.1 Estimating polynomials of a normal variance. Let X \,..., X n be 

distributed with joint density 


(2.3) 


1 

■- -exp 

(V2 7Tct)' ! 


1 

2er 2 


E(*i - H ) 2 


and assume, to begin with, that only one of the parameters is unknown. If a is 
known, it follows from Theorem 1.6.22 that the sample mean X is a complete 
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sufficient statistic, and since E(X) = £, X is the UMVU estimator of §. More 
generally, if g(f) is any U -estimable function of f, there exists a unique unbiased 
estimator <j(X) based on X and it is UMVU. If, in particular, g(§) is a polynomial 
of degree r, S(X) will also be a polynomial of that degree, which can be determined 
inductively for r = 2, 3,... (Problem 2.1). 

If | is known, (2.3) is a one-parameter exponential family with S 2 = E(V, — 
£) 2 being a complete sufficient statistic. Since Y = S 2 /o 2 is distributed as y 2 
independently of a 2 , it follows that 



where K n r is a constant, and hence that 
(2.4) K n , r S r 


is UMVU for a r . Recall from Example 1.5.14 with a = n/2, b = 2 and with r/2 
in place of r that 


E 



e [(x„ 2 r /2 ] 


T[(n + r)/2] y/2 
T(n/2) 


so that 
(2.5) 


Xu r — 


r(n/2) 


2 r l 2 V[(n + r)/2]' 

As a check, note that for r = 2, r = l/«, and hence E(S 2 ) = no 2 . 

Formula (2.5) is established in Example 1.5.14 only for r > 0. It is, however, 
easy to see (Problem 1.5.19) that it holds whenever 


(2.6) n > —r, 

but that the (r/2)th moment of y„ does not exist when n < —r. 

We are now in a position to consider the more realistic case in which both 
parameters are unknown. Then, by Example 1.6.24, X and S 2 = ^(2f, — X) 2 
jointly are complete sufficient statistics for o 2 ). This shows that X continues 
to be UMVU for f. Since var(V) = o 2 /«, estimation of o 2 is, of course, also of 
great importance. Now, S 2 /o 2 is distributed as x „_i an d it follows from (2.4) with 
n replaced by n — 1 and the new definition of S 2 that 

(2.7) K n _ hr S r 

is UMVU for o r provided n > — r + 1, and thus in particular S 2 /(n — 1) is UMVU 
for o 2 . 

Sometimes, it is of interest to measure f in a -units and hence to estimate 
g(§, er) = £/er. Now X is UMVU for f and K n -\ ^\/S for 1/a. Since X and 
S are independent, it follows that /S is unbiased for //a and hence 

UMVU, provided n - 1 > 1, that is, n > 2. 

If we next consider calculating the variance of K n _\ _i X/S or, more generally, 
calculating the variance of UMVU estimators of polynomial functions of / and a, 
we are led to calculating the moments E(X k ) and E(S k ) for all k = 1,2,.... This 
is investigated in Problems 2.4-2.6 j 
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Another class of problems within the framework of the normal one-sample 
problem relates to the probability 

(2.8) p = P(Xi < u). 

Example 2.2 Estimating a probability or a critical value. Suppose that the 
observations X, denote the performances of past candidates on an entrance exam¬ 
ination and that we wish to estimate the cutoff value u for which the probability 
of a passing performance, X > u, has a preassigned probability 1 — p. This is the 
problem of estimating u in (2.8) for a given value of p. Solving the equation 

(2.9) p = P{X\ < u) = <J> 

(where <l> denotes the cdf of the standard normal distribution) for u shows that 
u = g($, ct) = § +cr<P~ , (p). 

It follows that the UMVU estimator of u is 

(2.10) X + K n _ hX S<S>~\p). 

Consider next the problem of estimating p for a given value of u. Suppose, for 
example, that a manufactured item is acceptable if some quality characteristic is 
< u and that we wish to estimate the probability of an item being acceptable, its 
reliability , given by (2.9). 

To illustrate a method which is applicable to many problems of this type, con¬ 
sider, first, the simpler case that a = 1. An unbiased estimator <5 of p is the indicator 
of the event X i < u. Since A is a complete sufficient statistic, the UMVU estimator 
of p = P(X i < u ) = <J>(m — £) is therefore 

£[<5|A] = P[X i < u\X], 

To evaluate this probability, use the fact that X\ — X is independent of X. This 
follows from Basu’s theorem (Theorem 1.6.21) since A! — X is ancillary. 1 Hence, 

P[X i < u\x] = P[X, - X <u- x\x] = P[ Ai — A < u — x\, 


and the computation of a conditional probability has been replaced by that of an 
unconditional one. Now, Ai — A is distributed as ATO, (n — 1 )/n), so that 


( 2 . 11 ) 


P[X i — A<n — Jc] = 0 



(u — x) 


which is the UMVU estimator of p. 

Closely related to the problem of estimating p, which is the cdf 


F(u)= P[Xi < u] = <D(« -f) 


of Ai evaluated at u, is that of estimating the probability density at u : g(%) = 
4>{u — £). We shall now show that the UMVU estimator of the probability density 
g(H) = pf'UO °f A | evaluated at u is the conditional density of Ai given A 

1 Such applications of Basu’s theorem can be simplified when invariance is present. The theory and 
some interesting illustrations are discussed by Eaton and Morris (1970). 



94 


UNBIASEDNESS 


[2.2 


evaluated at u, MX) = p x ^ x (u). Since this is a function of X, it is only necessary 
to check that S is unbiased. This can be shown by differentiating the UMVU 
estimator of the cdf after justifying the required interchange of differentiation and 
integration, or as follows. Note that the joint density of X \ and X is p x ' ' x (u ) p x (x) 
and that the marginal density is therefore 




p x '\ x (u)p x (x)dx. 


This equation states just that MX) is an unbiased estimator of g(§). Differentiating 
the earlier equation 


P[X i < u\x] = O 



(u — x ) 


with respect to u, we see that the derivative 


^-[P[X 1 <u\X\ = 
an 


MX 


(u) 




(u - X) 


(where </> is the standard normal density) is the UMVU estimator of p x '(u). 

Suppose now that both £ and a are unknown. Then, exactly as in the case 
a = 1, the UMVU estimator of P[V! < u ] = <I>((m — f)/er) and of the density 
p x '(u) = (1 /a)<p{{u — £)/er) is given, respectively, by P[X\ < u\X , 5] and the 
conditional density of X i given X and S evaluated at u , where S 2 = E ( X, — X) 2 . To 
replace the conditional distribution with an unconditional one, note that {X\ — X)/S 
is ancillary and therefore, by Basu’s theorem, independent of (X. S). It follows, as 
in the earlier case, that 


( 2 . 12 ) 


P[X i < u\x, s] = P 


Xi- X 
5 



and that 


VI- 1 / U — X \ 

(2.13) p x '\ ' (u) = —f l — \ 

where / is the density of (X \ — X)/S. A straightforward calculation (Problem 
2.10) gives 


f(z) = 


r(¥) nr 

r(i)r(^)Vn-i 



(«/ 2)-2 


if 0 < |z| < 



(2.14) 

and zero elsewhere. The estimator (2.13) is obtained by substitution of (2.14), and 
the estimator (2.12) is obtained by integrating the density /. j 


We shall next consider two extensions of the normal one-sample model. The 
first extension is concerned with the two-sample problem, in which there are two 
independent groups of observations, each with a model of this type, but corre¬ 
sponding to different conditions or representing measurements of two different 
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quantities so that the parameters of the two models are not the same. The sec¬ 
ond extension deals with the multivariate situation of n p-tuples of observations 
( X \ v ,..., X pv ), v = 1with (X\ v ,..., X pv ) representing measurements of 
p different characteristics of the nth subject. 

Example 2.3 The normal two-sample problem. Let X \,..., X m and Y \,..., F„ 

be independently distributed according to normal distributions V(|, or 2 ) and /VO/, 
r 2 ), respectively. 

(a) Suppose that £, //. a, x are completely unknown. Then, the joint density 

(2-15) 70==-—7 -exp [--^E(x, - §) 2 - -^E(y ; - - p) 2 

(V27r)'” + "cr m r" L 2er- 2r 2 J 

constitutes an exponential family for which the four statistics 

X, Y, S\ = E(Z, - X) 2 , Sj = E (Yj - Y) 2 

are sufficient and complete. The UMVU estimators of £ and a r are therefore 
X and K n _\ r S r x , as in Example 2.1, and those of i] and r' are given by 
the corresponding formulas. In the present model, interest tends to focus on 
comparing parameters from the two distributions. The UMVU estimator of 
rj — | is Y — X and that of r r /cr r is the product of the UMVU estimators of 
x r and \/a r . 

(b) Sometimes, it is possible to assume that a = r. Then A", Y, and S 2 = E (X, — 
X) 2 + E (Yj — Y) 2 are complete sufficient statistics [Problem 1.6.35(a)] and 
the natural unbiased estimators of er r , p —§, and (rj — f )/a are all UMVU 
(Problem 2.11). 

(c) As a third possibility, suppose that ?/ = £ but that a and r are not known to be 
equal, and that it is desired to estimate the common mean £. This might arise, 
for example, when two independent sets of measurements of the same quantity 
are available. The statistics T = (X. Y, S\, Sy) are then minimal sufficient 
(Problem 1.6.7), but they are no longer complete since E(Y — X) = 0. 

If <j 2 /x 2 = y is known, the best unbiased linear combination of X and Y is 

x 2 I (a 2 x 2 

S r = aX + (1 — a)Y , where a = — / I- 1 - 

n / \ m n 

(Problem 2.12). Since, in this case, T' = (EV 2 +yEF 2 , EZ,+yEF ; ) is a complete 
sufficient statistic (Problem 2.12) and S y is a function of T, S y is UMVU. When 
a 2 /x 2 is unknown, a UMVU estimator of does not exist (Problem 2.13), but 
one can first estimate a, and then estimate § by § = aX + (1 — a)Y. It is easy 
to see that f is unbiased provided a is a function of only S 2 X and Sp (Problem 
2.13), for example, if a 2 and r 2 in a are replaced by .S'J /(in — 1) and S y /(n — 1). 
The problem of finding a good estimator of § has been considered by various 
authors, among them Graybill and Deal (1959), Hogg (1960), Seshadri (1963), 
Zacks (1966), Brown and Cohen (1974), Cohen and Sackrowitz (1974), Rubin 
and Weisberg (1975), Rao (1980), Berry (1987), Kubokawa (1987), Loh (1991), 
and George (1991). || 
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It is interesting to note that the nonexistence of a UMVU estimator holds not 
only for § but for any U -estimable function of §. This fact, for which no easy proof 
is available, was established by Unni (1978, 1981) using the results of Kagan and 
Palamadov (1968). 

In cases (a) and (b), the difference rj — £ provides one comparison between the 
distributions of the X's and T’s. An alternative measure of the superiority (if large 
values of the variables are desirable) of the T’s over the X’s is the probability 
p = P(X < Y). The UMVU estimator of p can be obtained as in Example 
2.2 as P(X l < Y\\X, Y , S\, Sy) and P(X\ < lj|X, T, S 2 ) in cases (a) and (b), 
respectively (Problem 2.14). In case (c), the problem disappears since then p = 1/2. 


Example 2.4 The multivariate normal one-sample problem. Suppose that (V, , 
ij, . ..), / = 1,..., n, are observations of p characteristics on a random sample of 
n subjects from a large population, so that the n p-vectors can be assumed to be 
iid. We shall consider the case that their common distribution is a p-variate normal 
distribution (Example 1.4.5) and begin with the case p = 2. 

The joint probability density of the (X,, Y,) is then 


(2.16) 


K 2uax^J\ — p 2 


exp 


2(1 - p 2 ) I - 2 Yj(X ' ?r 


- —S(.u - $)(yi - rj) + \ yx(yi - ?;) 2 


err 


where E(Xj) = E(Y,) = r /, var(Z,) = a 2 , var(T,) = r 2 , and cov(V,-, T,) = pa r, 

so that p is the correlation coefficient between X, and Y,. The bivariate family 

(2.16) constitutes a five-parameter exponential family of full rank, and the set of 
sufficient statistics T = (X, Y , S 2 X , Sy, Sxy ) where 

(2.17) Sxy = Y(X, - X)(Y t - Y) 


is therefore complete. Since the marginal distributions of the X, and T, are /V(/. a 2 ) 
and N(r], r 2 ), the UMVU estimators of § and a 2 are X and S\/{n — 1), and those 
of i] and r 2 are given by the corresponding formulas. The statistic Sxy /(« — 1) is an 
unbiased estimator of par (Problem 2.15) and is therefore the UMVU estimator 
of cov(X/, Yj). 

For the correlation coefficient p, the natural estimator is the sample correlation 
coefficient 

(2.18) R = Sxy/JsIS 2 . 

However, R is not unbiased, since it can be shown [see, for example, Stuart and 
Ord (1987, Section 16.32)] that 


(2.19) 


E(R) = p 



(1 - P 2 ) 
2 n 


+ O 



By implementing Method One of Section 2.1, together with some results from the 
theory of Laplace transforms, Olkin and Pratt (1958) derived a function G(R) of 
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R which is unbiased and hence UMVU. It is given by 


/1 1 n - 1 , 

G(r) = rF [ -, -;1 - r 2 

\2 2 ’ 2 

where F(a , b\ c;x) is the hypergeometric function 


F(a, b\ c; x ): 


r(fl + k)r(b + k)T{c)x k 
r(a)T(b)T(c + k)k\ 


r(c) r ^-'(l -ty- b - 1 
" im - *) io o -rxr r - 

Calculation of G(r) is facilitated by using a computer algebra program. Alterna¬ 
tively, by substituting in the above series expansion, one can derive the approxi¬ 
mation 

G(r) = r\l + + 0(X)] 


L 2(n — 1) \n 2 J J 

which is quite accurate. 

These results extend easily to the general multivariate case. Let us change 
notation and denote by ( X\ v , ..., X pv ), v = 1, ,.., n, a sample from a non¬ 
singular p-variate normal distribution with means E(Xj V ) = f and covariances 
cov(X iv , Xj V ) = Ojj. Then, the density of the X's is 


(2n) pn ! 2 


'2 ™ jk *jk 


J2(X jv - Sj){X kv 


and where 0 = (6j k ) is the inverse of the covariance matrix (crThis is a full-rank 


exponential family, for which the p + 


\p(p + 1) statistics X = EZ iv /n 


(i = 1,..., p) and Sj k = S(Z/ V , — Xj.)(X kv — X k .) are complete. 

Since the marginal distributions of the Xj v and the pair (X j v . X kv ) are univariate 
and bivariate normal, respectively, it follows from Example 2.1 and the earlier part 
of the present example, that X,. is UMVU for f and Sj k /(n — 1) for aj k . Also, the 
UMVU estimators of the correlation coefficients Pj k = ctjk! J a ]j° kk are just those 
obtained from the bivariate distribution of the (Xj v , X kv ). The UMVU estimator of 
the square of the multiple correlation coefficient of one of the p coordinates with 
the other p — 1 was obtained by Olkin and Pratt (1958). The problem of estimating 
a multivariate normal probability density has been treated by Ghurye and Olkin 
(1969); see also Gatsonis 1984. || 

Results quite analogous to those found in Examples 2.1-2.3 obtain when the 
normal density (2.3) is replaced by the exponential density 


--£0/ - a) 
b 


xi > a. 
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Despite its name, this two-parameter family does not constitute an exponential 
family since its support changes with a. However, for fixed a , it constitutes a 
one-parameter exponential family with parameter 1 /b. 

Example 2.5 The exponential one-sample problem. Suppose, first, that b is 
known. Then, X(\ ) is sufficient for a and complete (Example 1.6.24). The distri¬ 
bution of n[2f(i) — a]/b is the standard exponential distribution E( 0, 1) and the 
UMVU estimator of a is X, | , — ( b/n ) (Problem 2.17). On the other hand, when 
a is known, the distribution (2.22) constitutes a one-parameter exponential family 
with complete sufficient statistic T(X : — a). Since 2Y.(Xj — a)/b is distributed as 
X 2 n , it is seen that £(X; — a)/n is the UMVU estimator for b (Problem 2.17). 

When both parameters are unknown, X ( \, and Z[X, — Z(p] are jointly sufficient 
and complete (Example 1.6.27). Since they are independently distributed, n[X { \ } — 
a\/b as E( 0, 1) and 2£[X; — X ( \)]/b as X 2 (n-i) (Problem 1.6.18), it follows that 
(Problem 2.18) 

(2.23) _1_E[Z, -Z (1) ] and X (1) - *_ S[X, - X m ] 

are UMVU for b and a, respectively. 

It is also easy to obtain the UMVU estimators of a/b and of the critical value 
u for which P(X\ < u) has a given value p. If, instead, u is given, the UMVU 
estimator of P(X i < u) can be found in analogy with the normal case (Problems 
2.19 and 2.20). Finally, the two-sample problems corresponding to Example 2.3(a) 
and (b) can be handled very similarly to the normal case (Problems 2.21-2.23). || 

An important aspect of estimation theory is the comparison of different estima¬ 
tors. As competitors of UMVU estimators, we shall now consider the maximum 
likelihood estimator (ML estimator, see Section 6.2). This comparison is of in¬ 
terest both because of the widespread use of the ML estimator and because of its 
asymptotic optimality (which will be discussed in Chapter 6). If a distribution is 
specified by a parameter 0 (which need not be real-valued), the ML estimator of 
6 is that value 9 of 6 which maximizes the probability or probability density. The 
ML estimator of g(6) is defined to be g{9). 

Example 2.6 Comparing UMVU and ML estimators. Let X\, .... X n be iid 
according to the normal distribution N(i-, c 2 ). Then, the joint density of the X's is 
given by (2.3) and it is easily seen that the ML estimators of £ and a 2 are (Problem 
2.26) 

(2.24) | = X and a 2 = - V (X, - X) 2 . 

n L —' 

Within the framework of this example, one can illustrate the different possible 
relationships between UMVU and ML estimators. 

(a) When the estimand g(£, er) is if, then X is both the ML estimator and the 
UMVU estimator, so in this case, the two estimators coincide. 

(b) Let a be known, say a = 1, and let g(§, a) be the probability p = <J>(m — £) 
considered in Example 2.2 (see also Example 3.1.13). The UMVU estimator 
is «T>[ V«/(n — 1 )(m — X)], whereas the ML estimator is 0 (m — X). Since the 
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ML estimator is biased (by completeness, there can be only one unbiased 
function of X), the comparison should be based on the mean squared error 
(rather than the variance) 

(2.25) R s ($,a) = E[8-g(i;,cr)] 2 

as risk. Such a comparison was carried out by Zacks and Even (1966), who 
found that neither estimator is uniformly better than the other. For n = 4, for 
example, the UMVU estimator is better when \u — §| > 1.3 or, equivalently, 
when p < . 1 or p > .9, whereas for the remaining values the ML estimator 
has smaller mean squared error. 

This example raises the question whether there are situations in which the 
ML estimator is either uniformly better or worse than its UMVU competitor. 
The following two simple examples illustrate these possibilities. 

(c) If f and cr 2 are both unknown, the UMVU estimator and the ML estimator of 
a 2 are, respectively, S 2 /(n — 1) and S 2 /n, where .S' 2 = Jf(X, — X) 2 . Consider 
the general class of estimators cS 2 . An easy calculation (Problem 2.28) shows 
that 

(2.26) E(cS 2 - a 2 ) = cr 4 [(n 2 - 1 )c 2 - 2 (n - l)c + l]. 

For any given c, this risk function is proportional to a 4 . The risk functions 
corresponding to different values of c, therefore, do not intersect, but one lies 
entirely above the other. The right side of (2.26) is minimized by c = 1 /(n + 1). 
Since the values c = 1 /(n — 1) and c = l/n, corresponding to the UMVU and 
ML estimator, respectively, lie on the same side of 1 /(n + 1) with 1 /n being 
closer and the risk function is quadratic, it follows that the ML estimator has 
uniformly smaller risk than the UMVU estimator, but that the ML estimator, 
in turn, is dominated by S 2 /(n + 1). (For further discussion of this problem, 
see Section 3.3.) 

(d) Suppose that a 2 is known and let the estimand be £ 2 . Then, the ML estimator 
is X 2 and the UMVU estimator is X 2 — a 2 /« (Problem 2.1). That the risk of 
the ML estimator is uniformly larger follows from the following lemma. 


Lemma 2.7 Let the risk be expected squared error. If 8 is an unbiased estimator 
ofg(6) and if 8* = 8 + b, where the bias b is independent of 6, then 8* has uniformly 
larger risk than 8, in fact, 

R s *(9) = R s (6) + b 2 . 

For small sample sizes, both the UMVU and ML estimators can be unsatisfac¬ 
tory. One unpleasant possible feature of UMVU estimators is illustrated by the 
estimation of £ 2 in the normal case [Problem 2.5; Example 2.6(d)]. The UMVU 
estimator is X 2 —o 2 /n when cr is known, and X 2 — S 2 /n(n — 1) when it is unknown. 
In either case, the estimator can take on negative values although the estimand is 
known to be non-negative. Except when £ = 0 or n is small, the probability of such 
values is not large, but when they do occur, they cause some embarrassment. The 
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difficulty can be avoided, and at the same time the risk of the estimator improved, 
by replacing the estimator by zero whenever it is negative. This idea is developed 
further in Sections 4.7 and 5.6. It is also the case that most of these problems 
disappear in large samples, as we will see in Chapter 6. 

The examples of this section are fairly typical and suggest that the difference 
between the two estimators tends to be small. For samples from the exponential 
families, which constitute the main area of application of UMVU estimation, it 
has, in fact, been shown under suitable regularity assumptions that the UMVU and 
ML estimators are asymptotically equivalent as the sample size tends to infinity, 
so that the UMVU estimator shares the asymptotic optimality of the ML estimator. 
(For an exact statement and counterexamples, see Portnoy 1977b.) 

3 Discrete Distributions 

The distributions considered in the preceding section were all continuous. We shall 
now treat the corresponding problems for some of the basic discrete distributions. 

Example 3.1 Binomial UMVU estimators. In the simplest instance of a one- 
sample problem with qualitative rather than quantitative “measurements,” the ob¬ 
servations are dichotomous; cure or no cure, satisfactory or defective, yes or no. 
The two outcomes will be referred to generically as success or failure. 

The results of n independent such observations with common success probability 
p are conveniently represented by random variables X, which are 1 or 0 as the 
/th case or “trial” is a success or failure. Then, P(X, = 1) = p, and the joint 
distribution of the X’s is given by 

(3.1) P(X i = .V!, ..., X n = x n ) = (q= 1 - p ). 

This is a one-parameter exponential family, and T = T,Xj —the total number 
of successes—is a complete sufficient statistic. Since E(X,) = E(X) = p and 
X = T/n, it follows that T jn is the UMVU estimator of p. Similarly, Li A, — 
X) 2 /(n — 1) = T(n — T)/n(n — 1) is the UMVU estimator of var(X,) = pq 
(Problem 3.1; see also Example 1.13). 

The distribution of T is the binomial distribution b(p, ri), and it was pointed out 
in Example 1.2 that 1/p is not [/-estimable on the basis of 7’, and hence not in the 
present situation. In fact, it follows from Equation (1.2) that a function g(p) can 
be U -estimable only if it is a polynomial of degree < n. 

To see that every such polynomial is actually U -estimable, it is enough to show 
that p m is {/-estimable for every m < n. This can be established, and the UMVU 
estimator determined, by Method 1 of Section 1 (Problem 3.2). An alternative 
approach utilizes Method 2. The quantity p m is the probability 

P m = P{X ! = • • • = X m = 1) 

and its UMVU estimator is therefore given by 

m=P[Xi= - = X m = \\T = t]. 

This probability is 0 if t < m. For t > m. S(t) is the probability of obtaining m 
successes in the first m trials and t — m successes in the remaining n — m trials, 
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divided by P(T = t), and hence it is 


P 


m 


n — m 
t — m 


p*- m q n - 1 


/ 


p‘q n -\ 


or 

(3.2) 


S(T) = 


T(T - I)- - -{T -m+ 1) 
n(n — 1) • • • (n — m + 1) 


Since this expression is zero when T = 0, ..., m — 1, it is seen that 8(T), given by 

(3.2) for all T = 0, 1,..., n, is the UMVU estimator of p m . This proves that g(p ) 
is U -estimable on the basis of n binomial trials if and only if it is a polynomial of 
degree < n. j 


Consider now the estimation of 1 /p, for which no unbiased estimator exists. 
This problem arises, for example, when estimating the size of certain animal pop¬ 
ulations. Suppose that a lake contains an unknown number N of some species of 
fish. A random sample of size k is caught, tagged, and released again. Somewhat 
later, a random sample of size n is obtained and the number X of tagged fish in the 
sample is noted. (This is the capture-recapture method. See, for example, George 
and Robert, 1992.) If, for the sake of simplicity, we assume that each caught fish is 
immediately returned to the lake (or alternatively that N is very large compared to 
/?), the n fish in this sample constitute n binomial trials with probability p = k/N 
of success (i.e., obtaining a tagged fish). The population size N is therefore equal 
to k/p. We shall now discuss a sampling scheme under which 1 /p, and hence k/p, 
is U -estimable. 

Example 3.2 Inverse binomial sampling. Reliable estimation of 1 /p is clearly 
difficult when p is close to zero, where a small change of p will cause a large change 
in 1 / p. To obtain control of 1 jp for all p, it would therefore seem necessary to take 
more observations the smaller p is. A sampling scheme achieving this is inverse 
sampling, which continues until a specified number of successes, say m, have been 
obtained. Let Y + m denote the required number of trials. Then, Y has the negative 
binomial distribution given by (Problem 1.5.12) 

(3.3) P(Y = y ) = ( m + 1 ) p m ( 1 - p)>, y = 0, 1, . , 

with 

(3.4) E(Y ) = m( 1 — p)/p\ var(T) = m( 1 — p)/p 2 . 

It is seen from (3.4) that 

5(F) = (Y + m)/m, 

the reciprocal of the proportion of successes, is an unbiased estimator of 1 /p. 

The full data in the present situation are not Y but also include the positions in 
which the m successes occur. However, Y is a sufficient statistic (Problem 3.6), 
and it is complete since (3.3) is an exponential family. As a function of Y, S(Y) is 
thus the unique unbiased estimator of 1/p; based on the full data, it is UMVU. 
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It is interesting to note that 1/(1 — p) is not [/-estimable with the present 
sampling scheme, for suppose 5(F) is an unbiased estimator so that 

P m Y 8 (y ) ( m l y _i 1 ) (! - P) y = V(! - P) for all 0 < p < 1. 
v=o V ' 

The left side is a power series which converges for all 0 < p < 1, and hence 
converges and is continuous for all \p\ < 1. As p -> 1, the left side therefore 
tends to 5(0) while the right side tends to infinity. Thus, the assumed 5 does not 
exist. (For the estimation of p r , see Problem 3.4.) 

The situations described in Examples 3.1 and 3.2 are special cases of sequential 
binomial sampling in which the number of trials is allowed to depend on the 
observations. The outcome of such sampling can be represented as a random walk 
in the plane. The walk starts at (0, 0) and moves a unit to the right or up as the first 
trial is a success or failure. From the resulting point (1,0) or (0, 1), it again moves 
a unit to the right or up, and continues in this way until the sampling plan tells it 
to stop. A stopping rule is thus defined by a set B of points, a boundary, at which 
sampling stops. We require B to satisfy 

(3.5) Y p ( x >y)= l 

(■ x,y)eB 

since otherwise there is positive probability that sampling will go on indefinitely. 
A stopping rule that satisfies (3.5) is called closed. 

Any particular sample path ending in (x. y) has probability p x q y , and the prob¬ 
ability of a path ending in any particular point (x. y) is therefore 

(3.6) P(x, y) = N(x, y)p x q y , 

where N(x, y) denotes the number of paths along which the random walk can 
reach the point (x, y). As illustrations, consider the plans of Examples 3.1 and 3.2. 

(a) In Example 3.1, B is the set of points (x, y) satisfying x+y = n, x = (),..., n, 
and for any (x, y) e B, we have N(x, y) = 

(b) In Example 3.2, B is the set of points (x , y) with x = m\y = 0, 1,..., and 
for any such point 

The observations in sequential binomial sampling are represented by the sample 
path, and it follows from (3.6) and the factorization criterion that the coordinates 
( X , F) of the stopping point in which the path terminates constitute a sufficient 
statistic. This can also be seen from the definition of sufficiency, since the condi¬ 
tional probability of any given sample path given that it ends in {x, y) is 

p x q x = 1 

N(x,y ) p x q y N(x, y) ’ 



which is independent of p. 
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Example 3.3 Sequential estimation of binomial p. For any closed sequential 
binomial sampling scheme, an unbiased estimator of p depending only on the 
sufficient statistic ( X , Y) can be found in the following way. A simple unbiased 
estimator is S = 1 if the first trial is a success and <5 = 0 otherwise. Application of 
the Rao-Blackwell theorem then leads to 

S'(X, Y) = E[S|(X, Y)] = P[l st trial = success|(X, 7)] 

as an unbiased estimator depending only on (X, 7). If the point (1, 0) is a stopping 
point, then S' = S and nothing is gained. In all other cases. S' will have a smaller 
variance than S. An easy calculation [Problem 3.8(a)] shows that 

(3.7) S'(x, y ) = N'(x, y)/N(x, y) 

where N'(x, y) is the number of paths possible under the sampling schemes which 
pass through (1, 0) and terminate in {x, y). j 

More generally, if (a, b ) is any accessible point, that is, if it is possible under 
the given sampling plan to reach (a, b ), the quantity p a q h is {/-estimable, and 
an unbiased estimator depending only on (X, Y) is given by (3.7), where N'(x, y) 
now stands for the number of paths passing through (a, b ) and terminating in (x. y) 
[Problem 3.8(b)], 

The estimator (3.7) will be UMVU for any sampling plan for which the sufficient 
statistic (X, 7) is complete. To describe conditions under which this is the case, let 
us call an accessible point that is not in B a continuation point. A sampling plan 
is called simple if the set of continuation points C, on each line segment x + y = t 
is an interval or the empty set. A plan is called finite if the number of accessible 
points is finite. 

Example 3.4 Two sampling plans. 

(a) Let a, b, and m be three positive integers with a < b < m. Continue obser¬ 
vation until either a successes or failures have been obtained. If this does not 
happen during the first m trials, continue until either b successes or failures 
have been obtained. This sampling plan is simple and finite. 

(b) Continue until both at least a successes and a failures have been obtained. 
This plan is neither simple nor finite, but it is closed (Problem 3.10). j 

Theorem 3.5 A necessary and sufficient condition for a finite sampling plan to be 
complete is that it is simple. 

We shall here only prove sufficiency. [For a proof of necessity, see Girschick, 
Mosteller, and Savage 1946.] If the restriction to finite plans is dropped, simplicity 
is no longer sufficient (Problem 3.9). Another necessary condition in that case is 
stated in Problem 3.13. This condition, together with simplicity, is also sufficient. 
(For a proof, see Lehmann and Stein 1950.) 

For the following proof it may be helpful to consider a diagram of plan (a) of 
Example 3.4. 

Proof. Proof of sufficiency. Suppose there exists a nonzero function S( X. 7) whose 
expectation is zero for all p (0 < p < 1). Let to be the smallest value of t for 
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which there exists a boundary point (xo, yo) on x + y = to such that <5(xo, yo) ¥ 0. 
Since the continuation points on x + y = to (if any) form an interval, they all lie on 
the same side of (xo, yo). Suppose, without loss of generality, that (xo, yo) lies to 
the left and above C /0 , and let (xi, y i) be that boundary point on x + y = to above 
C, 0 and with S(x, y) ¥ 0, which has the smallest x-coordinate. Then, all boundary 
points with 8(x, y) A 0 satisfy t > to and x > Xi.lt follows that for all 0 < p < 1 

E[S(X, Y)] = N(x u yi)S(xi, yi^V 0- * 1 + p x ' +l R(p) = 0 

where R(p) is a polynomial in p. Dividing by p x ' and letting p -> 0, we see that 
5(xi, yi) = 0, which is a contradiction. □ 


Fixed binomial sampling satisfies the conditions of the theorem, but, there (and 
for inverse binomial sampling), completeness follows already from the fact that it 
leads to a full-rank exponential family (5.1) with s = 1. An example in which this is 
not the case is curtailed binomial sampling, in which sampling is continued as long 
as X < a, Y < b, and X + Y < n(a, b < n ) and is stopped as soon as one of the 
three boundaries is reached (Problem 3.11). Double sampling and curtailed double 
sampling provide further applications of the theory. (See Girshick, Mosteller, and 
Savage 1946; see also Kremers 1986.) 

The discrete distributions considered so far were all generated by binomial trials. 
A large class of examples is obtained by considering one-parameter exponential 
families (5.2) in which T(x) is integer-valued. Without loss of generality, we shall 
take T (x) to be x and the distribution of X to be given by 

(3.8) P(X = x) = e nx ~ B ^a(x). 


Putting 0 = e n . we can write (3.8) as 
(3.9) P(X = x) = a(x)9 x / C(6), 


x = 0, 1, 


> 0 . 


For any function ci(x) for which Ya(x)9 x < oo for some 0 > 0, this is a family 
of power series distributions (Problems 1.5.14-1.5.16). The binomial distribution 

Yl 

b(p , n ) is obtained from (3.9) by putting a(x) 

a(x) = 0 otherwise; 9 = p/q and C(9) 
m + x — 1 


tribution with a(x) 


for x = 0, 1. n, and 

: (9 + 1)". The negative binomial dis- 
q , and C(9) = (1 — 9)~ m is another 


m — 1 

example. The family (3.9) is clearly complete. If a{x) > 0 for all x = 0, 1,..., 
then 9 r is U -estimable for any positive integer r, and its unique unbiased estimator 
is obtained by solving the equations 


J2 S(x)a(x)9 x = 9 r • C(9) 


for all 9 e Yl. 


x=0 


Since Y,a(x)9 x 
(3.10) 


C(9). comparison of the coefficients of 9 X 

0 if x = 0 ,...,r 

a(x — r)/a(x ) if x > r. 


S(x) = 


yields 
- 1 


Suppose, next, that X\, ..., X n are iid according to a power series family (3.9). 
Then, X\ + ■ ■ ■ + X n is sufficient for 9, and its distribution is given by the following 
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lemma. 


Lemma 3.6 The distribution ofT = X\ + ■ ■ ■ + X n is the power series family 


(3.11) 


P(T = t) = 


A(t, n)9’ 

[cm n ’ 


where A(t, n) is the coefficient of 9' in the power series expansion of[C(9)] n . 


Proof. By definition. 


P(T = t) = 9'Y^ 

t 


a(xi) ■ ■ ■ a{x n ) 

[C(9)] n 


where E, indicates that the summation extends over all n -tuples of integers 
(* 1 , ..., x n ) with at + • • • + x n = t. If 

(3.12) Bit, n) = ajx i) • ■ ■ ajx n ). 


the distribution of T is given by (3.11) with B(t , n) in place of A(t, n). On the 
other hand. 


[C(9)] n 


Y] a(x)9 x 

.v=0 


and for any t = 0, 1,..., the coefficient of 9' in the expansion of the right side as 
a power series in 6 is just Bit, n). Thus, Bit. n) = Ait, n), and this completes the 
proof. □ 


It follows from the lemma that T is complete and from (3.10) that the UMVU 
estimator of 9 r on the basis of a sample of n is 


(3.13) 


m = 


o 

Ait — r, n) 
Ait, n ) 


if t = 0,..., r - 1 
if t > r. 


Consider, next, the problem of estimating the probability distribution of X from 
a sample X \, ..., X„. The estimand can be written as 


8(0) = Pe(X i = x) 


and the UMVU estimator is therefore given by 


^(r)=B[V, =x\X x + -+X n = t] 

_ PiX x = x)P(X 2 + ■ ■ ■ + X n = t - x) 
~ P(T = t) 


In the present case, this reduces to 


(3.14) 


m = 


aix)Af — x, n — 1) 
Ait, n ) 


n > 1, 0 < x < t. 


Example 3.7 Poisson UMVU estimation. The Poisson distribution, shown in 
Table 1.5.1, arises as a limiting case of the binomial distribution for large n and 
small p, and more generally as the number of events occurring in a fixed time 
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period when the events are generated by a Poisson process. The distribution P(9) 
of a Poisson variable with expectation 6 is given by (3.9) with 


(3.15) 

Thus, [C(6)] n = e" 6 and 

(3.16) 


a(x)=—, C(d) = e u . 
x\ 


A(t, n) = 


t\ 


The UMVU estimator of 9 r is therefore, by (3.13), equal to 

t(t - 1) • • • {t - r + 1) 


(3.17) 


m 


for all t > r. Since the right side is zero for t = 0,..., r — 1, formula (3.17) holds 
for all r. 

The UMVU estimator of Pg(X = x ) is given by (3.14), which, by (3.16), be¬ 
comes 



For varying x, this is the binomial distribution b(l/n, t). 

In some situations, Poisson variables are observed only when they are positive. 
For example, suppose that we have a sample from a truncated Poisson distribution 
(truncated on the left at 0) with probability function 

1 e x 

(3.18) P(X = x)=^— -, x = 1,2,.... 

e° — 1 x\ 

This is a power series distribution with 

1 

a(x) = — if x > 1, fl(0) = 0, 

x\ 


and 

C(0) = e e - 1. 

For any values of t and n, the UMVU estimator S(t) of 0, for example, can now 
be obtained from (3.13). (See Problems 3.18-3.22; for further discussion, see Tate 
and Goen 1958.) || 


We next consider some multiparameter situations. 

Example 3.8 Multinomial UMVU estimation. Let (Z 0 , Xi, ..., X n ) have the 
multinomial distribution (5.4). As was seen in Example 1.5.3, this is an .v-parameter 
exponential family, with (Xi, ..., X s ) or (X ()l X\,.. ., X s ) constituting a complete 
sufficient statistic. [Recall that Xo = n — (X\ + ■ ■ ■ + X 5 ).] Since E(Xj ) = npi, it 
follows that X//n is the UMVU estimator of p ,■. To obtain the UMVU estimator of 
Pipj, note that one unbiased estimator is 8 = 1 if the first trial results in outcome 
i and the second trial in outcome /, and <5 = 0 otherwise. The UMVU estimator of 
Pi pj is therefore 


E(8\X 0 ,...,X,) = 


(n - 2 )! Xj Xj 
Xo! • • • Xs! 


n! _ XjXj 
X 0 ! ■ • • X s ! “ n(n — 1) 
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Table 3.1. I x J Contingency Table 



B\ ■ 

■ Bj 

Total 

^1 

l • 

■■n u 

«i+ 

A; 

nn ■ 

■ ■ n u 

«/+ 

Total 

n+ i ■ 

■ ■ n+j 

n 


In the application of multinomial models, the probabilities po, ..., p s are fre¬ 
quently subject to additional restrictions, so that the number of independent param¬ 
eters is less than s. In general, such a restricted family will not constitute a full-rank 
exponential family, but may be a curved exponential family. There are, however, 
important exceptions. Simple examples are provided by certain contingency tables. 


Example 3.9 Two-way contingency tables. A number n of subjects is drawn at 
random from a population sufficiently large that the drawings can be considered 
to be independent. Each subject is classified according to two characteristics: A, 

with possible outcomes Ai ,..., A/, and B, with possible outcomes B\ . Bj. 

[For example, students might be classified as being male or female (1 = 2) and 
according to their average performance (A, B, C, D,or F; J = 5).] The probability 
that a subject has properties (A,, Bj) will be denoted by and the number of 
such subjects in the sample by n ir The joint distribution of the IJ variables n, ; - 
is an unrestricted multinomial distribution with s = IJ — 1, and the results of the 
sample can be represented in an I x / table, such as Table 3.1. From Example 3.8, 
it follows that the UMVU estimator of p/j is n i; /n. 

A special case of Table 3.1 arises when A and B are independent, that is, when 
Pij = Pi+P+j where p i+ = p n + ■ ■ ■ + p u and p +j = p xj + • • • + . The joint 

probability of the IJ cell counts then reduces to 


11 i,i n 


U^Up 


-j 

+j ' 


This is an (I + J — 2)-parameter exponential family with the complete sufficient 
statistics (n,-+, n+j), i = 1,...,/, j = 1,...,/, or, equivalently, i = 1,...,/ — 1, 

j = 1. J — 1. In fact, (n i+,...,«/+) and (n+i,..., n+j) are independent, 

with multinomial distributions M(p\ + , ..., pi+',n) and M(p+ 1 ,..., p+j\n), re¬ 
spectively (Problem 3.27), and the UMVU estimators of p !+ . p+j and p, ; = pt+p+j 
are, therefore, n,+/n, n+j/n and n i+ n + j /n 2 , respectively. | 


When studying the relationship between two characteristics A and B, one may 
find A and B to be dependent although no mechanism appears to exist through 
which either factor could influence the other. An explanation is sometimes found 
in the dependence of both factors on a common third factor, C, a phenomenon 
known as spurious correlation. The following example describes a model for this 
situation. 



108 


UNBIASEDNESS 


[2.3 


Example 3.10 Conditional independence in a three-way table. In the situation 
of Example 3.9, suppose that each subject is also classified according to a third 
factor C as Ci,..., or Ck • [The third factor for the students of Example 3.9 
might be their major (History, Physics, etc.).] Consider this situation under the 
assumption that conditionally given Q- (k = 1,..., K ), the characteristics A and 
B are independent, so that 

(3.19) p ijk = p++kPi+\kP+j\k 

where Pi+\k, P+j\k> and Pij\k denote the probability of the subject having properties 
Aj, Bj, or (Aj, Bj ), respectively, given that it has property C/ ; . 

After some simplification, the joint probability of the IJ K cell counts rijjk is 
seen to be proportional to (Problem 3.28) 


(3.20) 


Y[(P++kPi+\kP+j\k) nilt 

ij.k 




This is an exponential family of dimension 


(K - 1) + K(I + J - 2) = K(I + J - 1) - 1 


with complete sufficient statistics T = {(«++£, «,■+*, n+jk), i = 1= 
1,..., J, k = 1,..., K}. Since the expectation of any cell count is n times the 
probability of that cell, the UMVU estimators of p++k, Pi+k , and p+jk are n++k/n, 
rij+k/n, and n+jk/n, respectively. Ij 

Consider, now, the estimation of the probability pijk- The unbiased estimator 
<5o = n-ijk/n, which is UMVU in the unrestricted model, is not a function of 
T and hence no longer UMVU. The relationship (3.19) suggests the estimator 
< 5 ] = ( n ++ k/n) ■ (n i+ k/n ++ k) ■ (n+jk/n ++ k ), which is a function of T. It is easy to 
see (Problem 3.30) that (>) is unbiased and hence is UMVU. (For additional results 
concerning the estimation of the parameters of this model, see Cohen 1981, or 
Davis 1989.) 

To conclude this section, an example is provided in which the UMVU estimator 
fails completely. 

Example 3.11 Misbehaved UMVU estimator. Let X have the Poisson distribu¬ 
tion P(0) and let g(0) = e~ a6 , where a is a known constant. The condition of 
unbiasedness of an estimator S leads to 

y = e(1 _ a)e = y (1 -aye* 

^ x ! x ! 

and hence to 

(3.21) <$(X) = (1-a) x . 

Suppose a = 3. Then, g(6) = e _3fcl , and one would expect an estimator which 
decreases from 1 to 0 as X goes from 0 to infinity. The ML estimator e ]x meets 
this expectation. On the other hand, the unique unbiased estimator S(x) = (—2) x 
oscillates wildly between positive and negative values and appears to bear no 
relation to the problem at hand. (A possible explanation for this erratic behavior is 
suggested in Lehmann (1983).) It is interesting to see that the difficulty disappears 
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if the sample size is increased. If X\, ■ ■ •, X n are iid according to P(6), then 
T = X/ is a sufficient statistic and has the Poisson P(n0) distribution. The 
condition of unbiasedness now becomes 

V - ^ S(t)(n9)' _ (n—a)0 _ XT — a )'^ > 

2 - T\ T\ 

and the UMVU estimator is 

/ a\ T 

(3.22) S(T) = (l - -) . 

This is quite reasonable as soon as n > a. j 


4 Nonparametric Families 

Section 2.2 was concerned with continuous parametric families of distributions 
such as the normal, uniform, or exponential distributions, and Section 2.3 with 
discrete parametric families such as the binomial and Poisson distributions. We 
now turn to nonparametric families in which no specific form is assumed for the 
distribution. 

We begin with the one-sample problem in which X \,..., X„ are iid with distri¬ 
bution F e T. About the family T, we shall make only rather general assumptions, 
for example, that it is the family of distributions F which have a density, or are con¬ 
tinuous, or have first moments, and so on. The estimand g{F) might, for example, 
be E(Xj ) = f xdF(x), or varX,, or P(Xj < a) = F(a). 

It was seen in Problem 1.6.33 that for the family To of all probability densities, 
the order statistics X(\) < • • • < X (ll) constitute a complete sufficient statistic, and 
the hint given there shows that this result remains valid if To is further restricted by 
requiring the existence of some moments. 2 (For an alternative proofs, see TSH2, 
Section 4.3. Also, Bell, Blackwell, and Breiman (1960) show the result is valid for 
the family of all continuous distributions.) 

An estimator <5(Xi, ..., X„) is a function of the order statistics if and only if 
it is symmetric in its n arguments. For families T for which the order statistics 
are complete, there can therefore exist at most one symmetric unbiased estimator 
of any estimand, and this is UMVU. Thus, to find the UMVU estimator of any 
{/-estimable g(F), it suffices to find a symmetric unbiased estimator. 

Example 4.1 Estimating the distribution function. Let g(F) = P(X < a) = 

F(a), a known. The natural estimator is the number of A”s which are < «, di¬ 
vided by N. The number of such A’s is the outcome of n binomial trials with 
success probability F(a), so that this estimator is unbiased for F(a). Since it is 
also symmetric, it is the UMVU estimator. This can be paraphrased by saying 
that the empirical cumulative distribution function is the UMVU estimator of the 
unknown true cumulative distribution function. 

Note. In the normal case of Section 2.2, it was possible to find unbiased estimators 
not only of P(X < u) but also of the probability density px(u) of X. No unbiased 

2 The corresponding problem in which the values of some moments (or expectations of other functions) 

are given is treated by Hoeffding (1977) and N. Fisher (1982). 
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estimator of the density exists for the family Tq. For proofs, see Rosenblatt 1956, 
and Bickel and Lehmann 1969, and for further discussion of the problem of esti¬ 
mating a nonparametric density see Rosenblatt 1971, the books by Devroye and 
Gyoerfi (1985), Silverman (1986), or Wand and Jones (1995), and the review ar¬ 
ticle by Izenman (1991). Nonparametric density estimation is an example of what 
Liu and Brown (1993) call singular problems , which pose problems for unbiased 
estimation. See Note 8.3. j 

Example 4.2 Nonparametric UMVU estimation of a mean. Let us now further 
restrict Tq, the class of all distributions F having a density, by adding the condition 
E\X\ < oo, and let g(F) = f xf{x)dx. Since X is symmetric and unbiased for 
g(F), X is UMVU. An alternative proof of this result is obtained by noting that X \ 
is unbiased for g(F). The UMVU estimator is found by conditioning on the order 
statistics; E[X i|V(i),..., X ( „)]. But, given the order statistics, Xi assumes each 
value with probability 1 /n. Hence, the above conditional expectation is equal to 
(1 /«)£X ( ,■) = X. 

In Section 2.2, it was shown that X is UMVU for estimating E(X,) = § in the 
family of normal distributions JV(§, a 2 ); now it is seen to be UMVU in the family 
of all distributions that have a probability density and finite expectation. Which of 
these results is stronger? The uniformity makes the nonparametric result appear 
much stronger. This is counteracted, however, by the fact that the condition of 
unbiasedness is much more restrictive in that case. Thus, the number of competitors 
which the UMVU estimator “beats” for such a wide class of distributions is quite 
small (see Problem 4.1). It is interesting in this connection to note that, for a 
family intermediate between the two considered here, the family of all symmetric 
distributions having a probability density, X is not UMVU (Problem 4.4; see also 
Bickel and Lehmann 1975-1979). || 

Example 4.3 Nonparametric UMVU estimation of a variance. Let g(F) = 
var X. Then [E(X; — X) 2 ]/(n — 1) is symmetric and unbiased, and hence is UMVU. 


Example 4.4 Nonparametric UMVU estimation of a second moment. Let 

g(F) = £ 2 , where f = EX. Now, a 2 = E(X 2 ) — § 2 and a symmetric unbiased 
estimator of E(X 2 ) is L X 2 /n. Hence, the UMVU estimator of § 2 is L X 2 / n — 
Z(X, - X) 2 /(n - 1 ). 

An alternative derivation of this result is obtained by noting that X\Xi is un¬ 
biased for f 2 . The UMVU estimator of 'o 1 can thus be found by conditioning: 
E[X i, ..., X ( „)].But, given the order statistics, the pair {Vi, Tdassumes 

the value of each pair {X (( ), X^)}, i i- j , with probability 1 /n(n — 1). Hence, the 
above conditional expected value is 


1 

n(n — 1) 


It<7 


which is equivalent to the earlier result. 


Consider, now, quite generally a function g(F) which is (/-estimable in Tq. 
Then, there exists an integer m < n and a function <XX \, ..., X m ), which is 



2.4] 


NONPARAMETRIC FAMILIES 


111 


unbiased for g(F). We can assume without loss of generality that <5 is symmetric 
in its m arguments; otherwise, it can be symmetrized. Then, the estimator 


(4.1) 



8(X h ,...,X im ) 

O'l . >m) 


is UMVU for g(F ); here, the sum is over all w-tuples (i \,..., i m ) from the integers 
1,2,...,;? with ?i < • • • < i m . That this estimator is UMVU follows from the 

facts that it is symmetric and that each of the (:) summands has expectation 


8(F)- 

The class of statistics (4.1) called U-statistics was studied by Hoeffding (1948) 
who, in particular, gave conditions for their asymptotic normality; for further work 
on {/-statistics, see Serfling 1980, Staudte and Sheather 1990, Lee 1990, or Ko- 
roljuk and Borovskich 1994. 

Two problems suggest themselves: 


(a) What kind of functions g(F) have unbiased estimators, that is, are U-estimable ? 


(b) If a functional g(F) has an unbiased estimator, what is the smallest number 
of observations for which the unbiased estimator exists? We shall call this 
smallest number the degree of g(F). 

(For the case that F assigns positive probability only to the two values 0 and 1, 
these equations are answered in the preceding section.) 


Example 4.5 Degree of the variance. Let g(F) be the variance <r 2 of F. Then 
g(F) has an unbiased estimator in the subset T' {) of with EfX 2 < oo and n = 2 
observations, since E(V,- — X) 2 /(n — 1) = 1(2U — Vi) 2 is unbiased forcr 2 . Hence, 
the degree of a 2 is < 2. Furthermore, since in the normal case with unknown mean 
there is no unbiased estimator of a 2 based on only one observation (Problem 2.7), 
there is no such estimator within the class F' {] . It follows that the degree of a 2 is 2. 


We shall now give another proof that the degree of a 2 in this example is greater 
than 1 to illustrate a method that is of more general applicability for problems of 
this type. 

Let g be any estimand that is of degree 1 in T' {; . Then, there exists 8 such that 

J S(x)dF(x) = g(F), for all F e T'q. 

Fix two arbitrary distributions F\ and Fj in calF r Q with F\ V f/, and let F = 
aF\ + (1 — u)F2, 0 < a < 1. Then, 

(4.2) = 

Then, aF\ + (1 — ot)Fi is also in cal F and as a function of a , the right-hand side 
is linear in a. Thus, the only g's that can be of degree 1 are those for which the 
left-hand side is linear in a. 
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Now, consider 

g(F) = 4 = E(X 2 ) - [EX] 2 . 

In this case, 

(4-3) 4 1+( i-„ )f2 = aE(Xj) + (1 - a)E(X 2 ) - [aEX l + (1 - a)EX 2 f 

where X, is distributed according to F t . The coefficient of a 2 on the right-hand 
side is seen to be —[E(X 2 ) — E(X i)] 2 . Since this is not zero for all F\, F 2 e 
the right-hand side is not linear in a , and it follows that cr 2 is not of degree 1. || 


Generalizing (4.2), we see that if g(F) is of degree m , then 


(4.4) 


g[<*F\ + (1 - a)F 2 ] 

= /•••/ S(x i,... ,x m )d[aF 1 (x 1 ) + (1 - u)F 2 (x\)] ■ ■ ■ 
is a polynomial of degree at most m , 


which is thus a necessary condition for g to be estimable with m observations. 
Conditions for (4.4) to be also sufficient are given by Bickel and Lehmann (1969). 

Condition (4.4) may also be useful for proving that there exists no value of n 
for which a functional g(F) has an unbiased estimate. 


Example 4.6 Nonexistence of unbiased estimator. Let g(F) = a. Then g[a,F\ + 
(1 — u)F 2 \ is the square root of the right-hand side of (4.3). Since this quadratic 
in a is not a perfect square for all F \, IX e YF' {] , it follows that its square root is not 
a polynomial. Hence a does not have an unbiased estimator for any fixed number 
n of observations. i 


Let us now turn from the one-sample to the two-sample problem. Let Xi ,..., X m 
and Yi,... ,Y n be independently distributed according to distributions F and G e 
Tq. Then the order statistics X, |, < • • • < X, m) and Y, < • • • < Y (n) are sufficient 
and complete (Problem 4.5). A statistic 8 is a function of these order statistics if 
and only if 8 is symmetric in the X,X and separately symmetric in the Yf s. 

Example 4.7 Two-sample UMVU estimator. Let h( F. G ) = E(Y) — E(X). Then 
Y — X is unbiased for h(F, G). Since it is a function of the complete sufficient 
statistic, it is UMVU. || 

The concept of degree runs into difficulty in the present case. Smallest values m o 
and «o are sought for which a given functional hi F. G ) has an unbiased estimator. 
One possibility is to find the smallest m for which there exists an n such that 
h(F, G) has an unbiased estimator, and to let mo and no be the smallest values so 
determined. This procedure is not symmetric in m and n. However, it can be shown 
that if the reverse procedure is used, the same minimum values are obtained. [See 
Bickel and Lehmann, (1969)]. 

As a last illustration, let us consider the bivariate nonparametric problem. Let 
(Xi, Ti),..., ( X„ , Yu) be iid according to a distribution F e T. the family of 
all bivariate distributions having a probability density. In analogy with the order 
statistics in the univariate case, the set of pairs 

T = {[X (Vh T/,],.... [X (n ), Y Jn ]) 
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that is, the n pairs (A",-, F,), ordered according to the value of their first coordinate, 
constitute a sufficient statistic. An equivalent statistic is 

that is, the set of pairs (X,-, Y/) ordered according to the value of the second co¬ 
ordinate. Here, as elsewhere, the only aspect of T that matters is the set of points 
to which T assigns a constant value. In the present case, these are the n ! points 
that can be obtained from the given point [(Xi, Fi), ..., (X„, F„)] by permuting 
the n pairs. As in the univariate case, the conditional probability of each of these 
permutations given T or T is 1/n!. Also, as in the univariate case, T is complete 
(Problem 4.10). 

An estimator <5 is a function of the complete sufficient statistic if and only if S is 
invariant under permutation of the n pairs. Hence, any such function is the unique 
UMVU estimator of its expectation. 

Example 4.8 U -estimation of covariance. The estimator E (X, — X) (F, — F)/(«— 
1) is UMVU for cov(X, F) (Problem 4.8). || 


5 The Information Inequality 


The principal applications of UMVU estimators are to exponential families, as 
illustrated in Sections 2.2-2.3. When a UMVU estimator does not exist, the vari¬ 
ance Vl(0q) of the LMVU estimator at 9q is the smallest variance that an unbiased 
estimator can achieve at 6q. This establishes a useful benchmark against which to 
measure the performance of a given unbiased estimator S. If the variance of S is 
close to V[J0) for all 9, not much further improvement is possible. Unfortunately, 
the function V L (0) is usually difficult to determine. Instead, in this section, we shall 
derive some lower bounds which are typically not sharp [i.e., lie below V/fO)] but 
are much simpler to calculate. One of the resulting inequalities for the variance, the 
information inequality, will be used in Chapter 5 as a tool for minimax estimation. 
However, its most important role is in Chapter 6, where it provides insight and 
motivation for the theory of asymptotically efficient estimators. 

For any estimator S of g(9 ) and any function f(x, 9) with a finite second mo¬ 
ment, the covariance inequality (Problem 1.5) states that 


(5.1) 


var(<$) > 


[cov(<5, f)] 2 
var(i/f) 


In general, this inequality is not helpful since the right side also involves <5. How¬ 
ever, when cov(5, i jr) depends on <5 only through Ee(S) = g(9), (5.1) does provide 
a lower bound for the variance of all unbiased estimators of g(9). The following 
result is due to Blyth (1974). 


Theorem 5.1 A necessary and sufficient condition for cov(i5, f) to depend on S 
only through g(9) is that for all 9 


(5.2) 


cov(U, f) = 0 for all U eU, 
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where IA is the class of statistics defined in Theorem 1.1, that is, 

U = {U : E g U = 0, E g U 2 < oo , for all 9 e £2}. 

Proof To say that cov(<5, i/r) depends on S only through g( 6 ) is equivalent to 
saying that for any two estimators Si and 82 with Eg 8 \ = Eg&i for all 6 , we have 
cov(S!, f) = cov(<$ 2 , f). The proof of the theorem is then easily established by 
writing 

( 5 . 3 ) cov(< 5 i, 1//) — cov(<$2, f) = cov(Si — 82, f) 

= cov(U, f) 

and noting that therefore, cov(<5i, \jr) = cov( 82 , f) for all Si and S 2 if and only if 
cov(U, f) = 0 for all U. □ 


Example 5.2 Hammersley-Chapman-Robbins inequality. Suppose X is dis¬ 
tributed with density pg = p(x, 9), and for the moment, suppose that p(x, 9) > 0 
for all x.\i9 and 9 + A are two values for which g{9) f g(9 + A), then the function 


(5.4) 


f(x, 9) = 


p(x , 9 + A) 
p(x, 9) 


satisfies the conditions of Theorem 5.1 since 


(5.5) 


E e W = 0 


and hence 

co y(U, f) = E(fU) = E e+A (U) - E e (U ) = 0. 


In fact. 


cov(S, VO = E e ( 8 f) = g(9 + A) - g(9), 
so that (5.1) becomes 


(5.6) 


var(S) > [g(9 + A) - g(9)] 2 /E e 


p(X, 9 + A) 
P(X, 9) 


Since this inequality holds for all A, it also holds when the right side is replaced 
by its supremum over A. The resulting lower bound is due to Hammersley (1950) 
and Chapman and Robbins (1951). ! 


In this inequality, the assumption of a common support for the distributions pg 
can be somewhat relaxed. If S{9) denotes the support of pg, (5.6) will be valid 
provided S(9 + A) is contained in S(9). In taking the supremum over A, attention 
must then be restricted to the values of A for which this condition holds. 

When certain regularity conditions are satisfied, a classic inequality is obtained 
by letting A -* 0 in (5.4). The inequality (5.6) is unchanged if (5.4) is replaced 
by 

P8 +a - Pe 1 
A pg ’ 

which tends to ((d/d9)pg)/pg as A -> 0, provided pg is differentiable with respect 
to 9. This suggests as an alternative to (5.4) 

fix, 9) = p(x , 9)/p{x, 9). 


(5.7) 
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Since for any U eU. clearly ( d/cld)Eg(U ) = 0, f will satisfy (5.2), provided 


Eg(U) = J Upg dp 

can be differentiated with respect to 9 under the integral sign for all U e U. To 
obtain the resulting lower bound, let p' g = ( dpg/89) so that 

cov(<5, VO = J Sp'gdp. 

If differentiation under the integral sign is permitted in 


J Sp g dp 


= 8 ( 0 ), 


it then follows that 

(5.8) cov(i5, f) = g'(9) 


and hence 
(5.9) 


var(S) > 


[. g'm 2 



' 9 

var 

— log p(X, 9) 


The assumptions required for this inequality will be stated more formally in Theo¬ 
rem 5.15, where we will pay particular attention to requirements on the estimator. 
Pitman (1979, Chapter 5) provides an interesting interpretation of the inequality 
and discussion of the regularity assumptions. 

The function f defined by (5.7) is the relative rate at which the density pg 
changes at x. The average of the square of this rate is denoted by 


(5.10) 


m = Eg 


' 9 

w 


2 


log P(X, 9) 



pg dp. 


It is plausible that the greater this expectation is at a given value 9q, the easier it 
is to distinguish 9q from neighboring values 9, and, therefore, the more accurately 
9 can be estimated at 6 = 9q. (Under suitable assumptions this surmise turns out 
to be correct for large samples; see Chapter 6.) The quantity 1(9 ) is called the 
information (or the Fisher information) that X contains about the parameter 9. 

It is important to realize that 1(9) depends on the particular parametrization 
chosen. In fact, if 9 = h((;) and li is differentiable, the information that X contains 
about | is 

(5.n) /*(£) = im)\- wm 2 . 

When different parameterizations are considered in a single problem, the notation 
1(9) is inadequate; however, it suffices for most applications. 

To obtain alternative expressions for 1(9) that sometimes are more convenient, 
let us make the following assumptions: 


(a) is an open interval (finite, infinite, or semi-infinite). 

(b) The distributions Pg have common support, so that 
without loss of generality the set A = [x : pg(x) > 0} 
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(5.12) 

(c) 


Lemma 5.3 


is independent of 9. 

For any x in A and 0 in the derivative 
p' g (x) = 8pg(x)/89 exists and is finite. 


(a) If (5.12) holds, and the derivative with respect to 9 of the left side of 


(5.13) 


/ 


p e (x)dp,(x) = 1 


can be obtained by differentiating under the integral sign, then 

8 


(5.14) 

and 

(5.15) 


80 


log Po(X) 


0 


1(0) = var 0 


— log p e (X) 
69 


(b) If, in addition, the second derivative with respect to 9 of log pg(x) exists for 
all x and 9 and the second derivative with respect to 9 of the left side of (5.13) 
can be obtained by differentiating twice under the integral sign, then 


(5.16) 


m = -Eg 


9 2 

fffp. lo g P0 (x > 


Proof. 

(a) Equation (5.14) is derived by differentiating (5.13), and (5.15) follows from 
(5.10) and (5.14). 


(b) We have 


9 2 

ggl lo 8 P° (x) : 


9 2 

Pe(x) 


Pe(x) 


and the result follows by taking the expectation of both sides. 


□ 


Let us now calculate 1(9) for some of the families discussed in Sections 1.4 and 
1.5. 

We first look at exponential families with s = 1, given in Equation (1.5.1), and 
derive a relationship between some unbiased estimators and information. 

Theorem 5.4 Let X be distributed according to the exponential family (5.1) with 
5 = 1, and let 

(5.17) r (9) = E e (T), 

the so-called mean-value parameter. Then, T 


(5.18) 


I[r(9)] 


1 

var 0 (T)' 
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Proof. From Equation (5.15), the amount of information that X contains about 
9,1(0), is 


(5.19) 


1(0) = var g 


3 

— log p B (X) 


= var e [r,'(0)T(X) - B'(0)] 
= W(0)] 2 var(r). 


(from (1.5.1)) 


Now, from (5.11), the information I[r(0)] that X contains about r(0), is given by 

1 ( 0 ) 


I[T(0)} = 


(5.20) 


wm 2 

v'(ey 

r'(0). 


var(T). 


Finally, using the fact that r(0) = B'(9)/r]'(9) [(Problem 1.5.6)], we have 


(5.21) 


var(7): 


B”(9) - rf (0)r(0) \ r' l (9) 


1/2 


□ 


W(0)] 2 l>f-(0) 

and substituting (5.21) into (5.20) yields (5.18). 

If we combine Equations (5.11) and (5.19), then for any differentiable function 
h(0), we have 


(5.22) 


I[h(9 )] 


r)'(0) 

h'(9) 


var(T). 


Example 5.5 Information in a gamma variable. Let X ~ Gamma(a, ft), where 
we assume that a is known. The density is given by 

(5.23) pp(x ): ! 


-x a - l e- x,p 


T(a)ft a 
= e (-UP)x-ai°e U»h( x ) 


with h(x) = x a ~ l / T(a). In this parametrization, ij(ft) = —l/ft, T(x) = x and 
B(ft) = a log(yS). Thus, E(T ) = aft, var(T) = aft 2 , and the information in X about 
aft is I(aft) = {/aft 2 . 

If we are instead interested in the information in X about ft, then we can repa¬ 
rameterize (5.23) using rj(ft) = — a/ft and T(x) = x/a. From (5.22), we have, 
quite generally, that I[ch(6)] = 4 l[h(0)], so the information in X about ft is 
I(ft) = a/ft 2 . C || 


Table 5.1 gives I[r(9)] for a number of special cases. 

Qualitatively, I[r(9)] given by (5.18) behaves as one would expect. Since T is 
the UMVU estimator of its expectation r(9), the variance of T is a measure of the 
difficulty of estimating r (9). Thus, the reciprocal of the variance measures the ease 
with which r (9) can be estimated and, in this sense, the information X contains 
about r(9). 

Example 5.6 Information in a normal variable. Consider the case of the iV(£, a 2 ) 

distribution with a known, when the interest is in estimation of f 1 . The density is 
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Table 5.1. I[t(9)\ for Some Exponential Families 


Distribution 

Parameter t(0) 

/( T(0)) 

NO;, a 2 ) 

s 

l/a 2 

NO;, a 2 ) 

a 2 

l/ 2 ff 4 

b(p,n) 

P 

n/pq 

P( A) 

A 

1/A 

r>, /?) 

P 

a/)9 2 


given by 


pt (x) = _ e 

V27T a 

with r)(f ) = £, T(x) = x/a 2 , B(f) = \% 2 /o 2 , and /;(*) = e~ x ^ 2al Isflin. The 
information in X about h(f) = f 2 is given by 


IV; 2 ) = 


MS). 


var (T) = 


1 

W^ 2 ' 


Note that we could have equivalently defined ;?(£) = f /a 2 , T(x) = x and arrived 
at the same answer. i 


Example 5.7 Information about a function of a Poisson parameter. Suppose 
that X has the Poisson (A) distribution, so that /[A], the information X contains 
about A = E(X), is 1/A. For 17 (A) = log A, which is an increasing function of 
A, /[log A] = A. Thus, the information in X about A is inversely proportional to 
that about log A . In particular, for large values of A, it seems that the parameter 
log A can be estimated quite accurately, although the converse is true for A. This 
conclusion is correct and is explained by the fact the log A changes very slowly 
when A is large. Hence, for large A, even a large error in the estimate of A will 
lead to only a small error in log A, whereas the situation is reversed for A near 
zero where log A changes very rapidly. It is interesting to note that there exists a 
function of A [namely / 7 (A) = VA] whose behavior is intermediate between that of 
h(k) = A and h(k) = log A, in that the amount of information X contains about it 
is constant, independent of A (Problem 5.6). j 


As a second class of distributions for which to evaluate 1(9), consider location 
families with density 

(5.24) f(x — 9) (x, 9 real-valued) 


where fix) > 0 for all x. Conditions (5.12) are satisfied provided the derivative 
f'(x) of f(x) exists for all values of x. It is seen that 1(9) is independent of 9 and 
given by (Problem 5.14) 


r 2 


U'(x)Y 


tSt*' 


(5.25) 


m 
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Table 5.2. If for Some Standard Distributions 


Distribution N( 0, 1) L(0, 1) 

C(0,1) 

DE( 0, 1) 

1/ 1 1/3 

1/2 

1 


Table 5.2 shows If for a number of distributions (defined in Table 1.4.1). 

Actually, the double exponential density does not satisfy the stated assumptions 
since f'(x) does not exist at x = 0. However, (5.25) is valid under the slightly 
weaker assumption that / is absolutely continuous [see (1.3.7)] which does hold 
in the double exponential case. For this and the extensions below, see, for example, 
Huber 1981, Section 4.4. On the other hand, it does not hold when / is the uniform 
density on (0, 1) since / is then not continuous and hence, a fortiori, not absolutely 
continuous. It turns out that whenever / is not absolutely continuous, it is natural to 
put If equal to oo. For the uniform distribution, for example, it is easier by an order 
of magnitude to estimate 9 (see Problem 5.33) than for any of the distributions 
listed in Table 5.2, and it is thus reasonable to assign to If the value oo. This 
should be contrasted with the fact that /'( x) = 0 for all x A 0. 1, so that formal 
application of (5.25) leads to the incorrect value 0. 

When (5.24) is replaced by 


(5.26) 


-f (-—- 
b J \ b 


the amount of information about 6 becomes (Problem 5.14) 


(5.27) 


If 

b 2 


with If given by (5.25). 

The information about 9 contained in independent observations is, as one would 
expect, additive. This is stated formally in the following result. 

Theorem 5.8 Let X and Y be independently distributed with densities pg and qg, 
respectively, with respect to measures ft and v satisfying (5.12) and (5.14). 

If I\(9), 1 2 ( 9 ), and 1(9 ) are the information about 9 contained in X , Y, and 
(X, Y), respectively, then 

( 5 . 28 ) ' m = im+h(e). 

Proof. By definition. 


1(9) = E 


" a 

39 


log p e (X) + 


3 

39 


2 


lo gqo(Y) 


and the result follows from the fact that the cross-product is zero by (5.14). □ 


Corollary 5.9 If X 1 ,..., X„ are iid, satisfy (5.12) and (5.14), and each has in¬ 
formation 1(9), then the information in X = (X\, ..., X„) is nl(9). 

Let us now return to the inequality (5.9), and proceed to a formal statement of 
when it holds. If (5.12), and hence (5.15), holds, then the denominator of the right 
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side of (5.9) can be replaced by 1(6). The result is the following version of the 
Information Inequality. 


Theorem 5.10 (The Information Inequality) Suppose pg is a family of densities 
with dominating measure p for which (5.12) and (5.14) hold, and that 1(6) > 0. 
Let 8 be any statistic with 
(5.29) E 0 (S 2 ) < oo 

for which the derivative with respect to 6 of Eg (8) exists and can be differentiated 
under the integral sign, that is. 


(5.30) 
Then 

(5.31) 


4 E 0 (S) = J ^8p e dp. 


dd 


var 0 (<$) > 


— E e (8) 
dd 


1 ( 6 ) 


Proof. The result follows from (5.9)and Lemma 5.3 and is seen directly by differ¬ 
entiating (5.30) and then applying (5.1). □ 


If 8 is an estimator of g(6). with 

Eg(8) = g(6) + b(6) 


where b(6) is the bias of 8 , then (5.31) becomes 


(5.32) 


varg((5) > 


(b'(6) + g'(6)] 2 

1 ( 6 ) 


which provides a lower bound for the variance of any estimator in terms of its bias 
and 1 (6). 

If 8 = S(X) where X = (X \,..., X n ) and if the XX are iid, then by Corollary 


5.9 

(5.33) 


var 0 (<5) > 


(b'(6) + g'(6)] 2 


nh(6) 

where I\(9) is the information about 6 contained in X t . Inequalities (5.32) and 
(5.33) will be useful in Chapter 5. 

Unlike 1(6), which changes under reparametrization, the lower bound (5.31), 
and hence the bounds (5.32) and (5.33), does not. Let 6 = h(%) with h differentiable. 
Then, 


l-E m (8)=l Q E e (8)-h'(^), 

dd 

and the result follows from (5.11). (See Problem 5.20.) 

The lower bound (5.31) for varg(<5) typically is not sharp. In fact, under suitable 
regularity conditions, it is attained if and only if pg(x) is an exponential family 
(1.5.1) with .5 = 1 and T(x) = S(x) (see Problem 5.17). However, (5.1) is based 
on the Cauchy-Schwarz inequality, which has a well-known condition for equality 
(see Problems 5.2 and 5.19). The bound (5.31) will be attained by an estimator if 
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and only if 

(5.34) 8 = a log p 0 (x) +b 

for some constants a and b (which may depend on 9). 

Example 5.11 Binomial attainment of information bound. For the binomial 
distribution X ~ b(p, n), we have 



with E8 = b and var 8 = na 2 / p( 1 — p ). This form for 8 is the only form of function 
for which the information inequality bound (5.31) can be attained. The function 
8 is an estimator only if a = p{ 1 — p) and b = np. This yields 8 = X, E8 = np, 
and var( 8 ) = np( 1 — p). Thus, X is the only unbiased estimator that achieves the 
information inequality bound (5.31). 1 


Many authors have presented general necessary and sufficient conditions for 
attainment of the bound (5.31) (Wijsman 1973, Joshi 1976, Miiller-Funk et al., 
1989). The following theorem is adapted from Miiller-Funk et al. 

Theorem 5.12 Attainment. Suppose (5.12) holds, and8 is a statistic with vaiv,<5 < 
oo for all 9 e £1 Then 8 attains the lower bound 

vare<5= I 

for all 6 e Q if and only if there exists a continuously differentiable function <p(9) 
such that 

p e (x) = C(e)e ,pmM h(x) 

is a density with respect to a dominating measure p(x) for suitably chosen C(6) 
and h(x), i.e., p$ (x) constitutes an exponential family. 

Moreover, if Eg8 = g(9), then 8 and g satisfy 

IY(60"I 3 

(5.35) 8 {x)= —log p e (x) + g(9), 

_ l{d) J O 

g(9)=-C'(9)/C(9)<p'(9), 

and 1(9) = (p'(9)g'(9). 

Note that the function 8 specified in (5.35) may depend on 9. In such a case, 8 is 
not an estimator, and there is no estimator that attains the information bound. 

Example 5.13 Poisson attainment of information bound. Suppose A is a dis¬ 
crete random variable with probability function that is absolutely continuous with 
respect to p = counting measure, and satisfies 


E\X = var x (A) = 1. 
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If X attains the Information Inequality bound then X = [9/(9A)£\(X)] 2 // (X) so 
from Theorem 5.12 <p'(X) =1/1 and the distribution of X must be 

px(x) = C(X)e [l ° sX]x h(x). 

Since g(0) = X = — XC'(X)/C(X), it follows that C(X ) = e~ x , which implies 
h(x ) = .r!, and px(x) is the Poisson distribution. j 

Some improvements over (5.31) are available when the inequality is not attained. 
These will be briefly mentioned at the end of the next section. 


Theorem 5.10 restricts the information inequality to estimators 8 satisfying 

(5.29) and (5.30). The first of these conditions imposes no serious restrictions 
since any estimator satisfies (5.31) automatically. However, it is desirable to replace 

(5.30) by a condition (on the densities pg) not involving <5, so that (5.31) will then 
hold for all 8 . Such conditions will be given in Theorem 5.15 below, with a more 
detailed discussion of alternatives given in Note 8.6. 

In reviewing the argument leading to (5.9), the conditions that were needed on 
the estimator S(x) were 

(a) Eg[8 2 (Xf\ < oo for all 6 

9 C 9 

(5.36) (b) — Eg(8(X)] = j —8(x)p e (x)dp(x) = g'(0). 

The key point is to find a way to ensure that cov(5, <fi) = ( d/d9)EgS , and hence 

(5.30) holds. Consider the following argument, in which one of the steps is not 
immediately justified. For q$(x) = 9 log pg(x)/86, write 


(5.37) 


cov(<5, q) = 


9 


/• 
/ sw [ 
/ 


— log Po(x) 
ov 


lim 

A—>-0 


lim 

A—>-0 

lim 

A^0 


pe(x)dx 

P6+ a(x) - pg(x)' 
A pg(x) 

P6+ a(x) - pg(x)' 


A pg(x) 
Eg +A S(X) - Eg8(X ) 
A 


pg(x)dx 

pg(x)clx 


9 

= —Eg8(X) 
86 


Thus (5.30) will hold provided the interchange of limit and integral is valid. A 
simple condition for this is given in the following lemma. 

Lemma 5.14 Assume that (5.12(a)) and (5.12(b)) hold, and let 8 be any estimator 
for which EgS 2 < oo. Let qg(x) = 9 log pg(x)/86 and, for some s > 0, let bg be a 
function that satisfies 


(5.38) Egbl(X) < oo and 


P8+a(x) - pg(x) 
A pg(x) 


< b g (x)forall |A| < e. 


Then E g qg(X) = 0 and 

(5.39) Eg8(X) = E8(X)q e (X) = cov e (8, q e ). 
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and thus (5.30) holds. 


Proof. Since 


S(x) 


Pe+ a(x) - po(x) 
A p g (x) 


< \8(x)\b(x)\, 


and 

E e [\8(x)\b(x)] < {E g [8(x) 2 ]} 1/2 {E 0 [b(x) 2 ]} 1/2 « oo, 


it follows from the Dominated Convergence Theorem (Theorem 1.2.5) that the 
interchange of limit and integral in (5.37) is valid. □ 


An immediate consequence of Lemma 5.14 is the following theorem. 


Theorem 5.15 Suppose pg(x) is a family of densities with dominating measure 
pt(x) satisfying (5.12), 1(9) > 0, and there exists a function bg and e > 0 for which 
(5.38) holds, If 8 is any statistic for which Eg(8 2 ) << oo, then the information 
inequality (5.31) will hold. 


We note that condition (5.38) is similar to what is known as a Lipschitz condition, 
which imposes a smoothness constraint on a function by bounding the left side 
of (5.38) by a constant. It is satisfied for many families of densities (see Problem 
5.27), including of course the exponential family. We give one illustration here. 

Example 5.16 Integrability. Suppose that X ~ fix — 0), where fix — 0) is 
Students t distribution with m degrees of freedom. It is not immediately obvious 
that this family of densities satisfies (5.14), so we cannot directly apply Theorem 
5.10. We leave the general case to Problem 5.27(b), and show here that the Cauchy 
family (m = 1), with density pg(x) = ^ > satisfies (5.38). The left side of 

(5.38) is 


i / i + (x-d) 2 \ 

A V1 + (x - A — 9) 2 ) 

1 1 + (x - 0) 2 - 1 - (x — A - 9) 2 
A 1 + (x — A — 9) 2 

1 2A(jc - 9) - A 2 
A 1 + (x - A - 9) 2 

l*-A-0| ; IA | 

“ 1 + (x — A — 9) 2 1 + (x — A — 9) 2 

<2 + e. 


Here the last inequality follows from the facts that | A | < < e and \x \/(1 + x 2 ) < 1 
for all x . Condition (5.38) therefore holds with bg(x) = 2 + e, which verifies the 
information inequality (5.31) for the Cauchy case. j 


As a consequence of Theorem 5.15, note 
Corollary 5.17 If (5.38) holds, then (5.14) is valid. 

Proof. Putting <5 (a) = 1 in (5.29), we have that 

j te ndll ’ E ‘ 

□ 


— log P g(X) 
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6 The Multiparameter Case and Other Extensions 


In discussing the information inequality, we have so far assumed that 9 is real¬ 
valued. To extend the inequalities of the preceding section to the multiparameter 
case, we begin by generalizing the inequality ( 5 . 1 ) to one involving several func¬ 
tions i fr, (i = 1, ..., r). This extension also provides a tool for sharpening the 
inequality ( 5 . 31 ). 

Theorem 6.1 For any unbiased estimator 8 of g(9) and any functions ffx, 9) 
with finite second moments, we have 

(6.1) var(5) > y'C~ 1 y, 


where y' = (yi • • ■ y r ) and C = \ \ Cij 11 are defined by 
( 6 . 2 ) yi = cov(8, ft), Cij = cov(fii,fij). 


The right side of (6.1) will depend on 8 only through g(9) = Eg(8), provided each 
of the functions fij satisfies (5.2). 

Proof. For any constants a\ .it follows from (5.1) that 

[cov(5, Tajfii)] 2 


(6.3) var((>) > 

and direct calculation shows 


var CEaifij) 


(6.4) cov(i5, Datt/fi) = = a'y, var(Efl;i/r,) = a'Ca. 


Since (6.3) is true for any vector a , from (6.4) and (5.1) we have 

[a'y] 2 , 

var(<5) > max-= y C 1 y, 

a a'Ca 

where we use the fact (see Problem 6.2) that if P is an r x r matrix and p an r x 1 
column vector such that P = pp', then 


(6.5) 


a Pa 

max-= largest eigenvalue of Q P 

a a'Oa 

= p'Q~ 1 p- 


□ 


As the first and principal application of (6.1), we shall extend the information 
inequality (5.31) to the multiparameter case. Let X be distributed with density 
pg, 9 e Q, with respect to /x where 9 is vector-valued, say 9 = (9\,..., 9 S ). 
Suppose that 

(5.12)(a) and (b) hold, and in addition 
(6.6) (c) For any x in A, 9 in £2, and i = 1 ,..., s, 

the derivative dpg(x)/d9 , exists and is finite. 


In a generalization of (5.10), define the information matrix as the .v x .v matrix 
(6.7) 1(9) = ||7y(0)|| 
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where 

( 6 . 8 ) 


Ium = Eg 


9 9 

— log pg(X) ■ — log po(X) 
at), 


d9j uvj 

If (6.6) holds and the derivative with respect to each 9 ,■ of the left side of (5.13) 
can be obtained by differentiating under the integral sign, one obtains, as in Lemma 
5.3, 

3 1 

— log P g(X) 


(6.9) 

and 

( 6 . 10 ) 


89; 


= 0 


= cov 


9 9 

— log pg(X), — log pg(X) 


89 , 


89, 


Being a covariance matrix, 1(9) is positive semidefinite and positive definite unless 
the (9/96?,) log pg(X), i = 1,..., s, are affinely dependent (and hence, by (6.9), 
linearly dependent). 

If, in addition to satisfying (6.6) and (6.9), the density pg also has second deriva¬ 
tives d 2 pg(x)/d6jddj for all i and there is in generalization of (5.16), an alter¬ 
native expression for hj(9) which is often more convenient (Problem 6.4), 

T 3 2 

(6.11) Iij(9) = -E ——log p 0 (X)\ 

oOjoVj 


In the multiparameter situation with 6 = (0,, ..., 0 S ). Theorem 5.8 and Corollary 
5.9 continue to hold with only the obvious changes, that is, information matrices 
for independent observations are additive. 

To see how an information matrix changes under reparametrization, suppose 
that 

(6.12) e i =h ,■(&,..., &), i = 


and let J be the matrix 
(6.13) 


J = 


96/ 
3 Hi 


Let the information matrix for (|i, ..., f s ) be /*(§) = ||/ ! *(f)|| where 


(6.14) 


/,y(?) = E 


9 9 

I log PeapiX) ■ — log pe^(X) 

L 3?i 9?J 


Then, it is seen from the chain rule for differentiating a function of several variables 
that (Problem 6.7) 

v—> v—> d6k 86/ 

and hence that 

(6.16) I*($)=JIJ'. 

In generalization of Theorem 5.4, let us now calculate 1(9) for multiparameter 
exponential families. 


Theorem 6.2 Let X be distributed according to the exponential family (1.5.1) and 
let 
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(6.17) Tj = ETfX), i = 
the mean-value parametrization. Then, 

(6.18) I(r) = C~ 1 


where C is the covariance matrix of(T\,..., T s ). 

Proof. It is easiest to work with the natural parametrization (1.5.2), which is equiv¬ 
alent. By (6.10) and (1.5.15), the information in X about the natural parameter i] 
is 


1 * 0 7 ) = 


drjjdrjk 


A(xi) 


= CO v(7), T k ) = C. 


Furthermore, (1.5.14) shows that Tj = 3/3 rjjA(rj) and, hence, (6.13) shows that 


J = 


diii 


= c. 


Thus, from (6.16) 

C = I*{rj) = JI(r)J' = C/(r)C, 

which implies (6.18). □ 


Example 6.3 Multivariate normal information matrix. Let (Xi, ..., X p ) have 
a multivariate normal distribution with mean 0 and covariance matrix E = 11cr,-y 11, 
so that by (1.4.15), the density is proportional to 

e -££i?iV XiXj/2 

where ||/j i7 | | = EL 1 . Since E{X,Xj) = (T i; -, we find that the information matrix of 
the Oij is 

(6.19) /(E) = E" 1 . 

Example 6.4 Exponential family information matrices. Table 6.1 gives 1(9) for 
three two-parameter exponential families, where i //(a) = T'(a)/ F(q’) and f(a) = 
d\lr(a)/da are, respectively, the digamma and trigamma function (Problem 6.5). 


Example 6.5 Information in location-scale families. For the location-scale fam¬ 
ilies with density (\/9f)f((x — > 0, f(x) > 0 for all x, the elements 

of the information matrix are (Problem 6.5) 


( 6 . 20 ) 


and 

( 6 . 21 ) 


/ ii = 


y 

/ 


722 = 


hl= X2 


f(y) 

L f(y ) J 

' yf'(y) 

. m 


f(y)dy, 


+1 


/ 


.TOO ' 
f(y ). 


f(y)dy 


f(y)dy. 
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Table 6.1. Three Information Matrices 


N(fo 2 ) r(or,/3) 

/(?, cr) = ( 

1 / ff2 0 ^ to ( ria) 

0 2/a 2 ) ^ 1/p a/p 2 ) 

B(a, p) 

I(a,P) = 

f fi'(a)-fi'(a + P) -f'(a + P) \ 

^ -fi'(a + p) f(P)-f(<x + P)J 


The covariance term /12 is zero whenever / is symmetric about the origin. 


Let us now generalize Theorems 5.10 and 5.15 to the multiparameter case in 
which 0 = (6, 9 S ). For convenience, we state the generalizations in one the¬ 
orem. 


Theorem 6.6 (Multiparameter Information Inequality) Suppose that (6.6) holds, 
and 1(0) is positive definite. Let S be any statistic for which £^(|S| 2 ) < 00 and 
either 

(i) For i = 1, ..., s, (d/dd^EgS exists and can be obtained by differentiating 
under the integral sign, 


(ii) There exist functions b ( i\ i = 1, ..., s, with Eq b'L’(X) 1 < 00 that satisfy 


(Or 


P0 + aJ x ) ~ P0 (x ) 


A 


0 

< b { ^(x)for all A, 


where e,- e R s is the unit vector with 1 in the ith position and zero elsewhere. 
Then, Eyid/dOi) log Pq(X.) = 0 and 

(6.22) var e (<5) > a'r 1 (0)a 


where a' is the row matrix with ith element 
(6.23) «,■ = E 0 [S(X )]. 

60i 

Proof. If the functions if of Theorem 6.1 are taken to be if = 
(6/d0i) log p e (X), (6.22) follows from (6.1) and (6.10). □ 


If S is an estimator of g(0) and b(0 ) is its bias, then (6.23) reduces to 
(6.24) a, = ^r[b(0) + gm. 

60 i 

It is interesting to compare the lower bound (6.22) with the corresponding bound 
when the 0’s other than 0, are known. By Theorem 5.15, the latter is equal to 
[(d/d9i)Eg(S)] 2 /lu(0). This is the bound obtained by setting a = e,- in (6.4), 
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where e,- is the /th unit vector. For example, if the 0’s other than 0; are zero, then 
the only nonzero element of the vector a of (6.22) is a ;. Since (6.22) was obtained 
by maximizing (6.4), comparing the two bounds shows 

(6.25) //7‘(0) < \\r l (9)\\ii. 

(See Problem 6.10 for a different derivation.) The two sides of (6.25) are equal if 

(6.26) /, 7 (0) = O for all j ^i, 

as is seen from the definition of the inverse of a matrix, and, in fact, (6.26) is also 
necessary for equality in (6.25) (Problem 6.10). In this situation, when (6.26) holds, 
the parameters are said to be orthogonal. This is illustrated by the first matrix in 
Table 6.1. There, the information bound for one of the parameters is independent 
of whether the other parameter is known. This is not the case, however, in the 
second and third situations in Table 6.1, where the value of one parameter affects 
the information for another. Some implications of these results for estimation will 
be taken up in Section 6.6. (Cox and Reid (1987) discuss methods for obtaining 
parameter orthogonality, and some of its consequences; see also Barndorff-Nielsen 
and Cox 1994.) 

In a manner analogous to the one-parameter case, it can be shown that the 
information inequality bound is attained only if S(x) has the form 

(6.27) S(x ) = g(0) + [Vg(0)]7(0)- 1 [Vlogp (? (x)], 


where E8 = g(9), Vg(0) = {0/30 ; )g(0), i = 1, 2 ,. .., s}, V log p e {x) = {(3/30,) 
log pe(x), i = 1,2,, .?}. It is also the case, analogous to Theorem 5.12, that 
if the bound is attainable then the underlying family of distributions constitutes 
an exponential family (Joshi 1976, Fabian and Hannan, 1977; Miiller-Funk et al. 
1989). 

The information inequalities (5.31) and (6.22) have been extended in a number 
of directions, some of which are briefly sketched in the following. 

(a) When the lower bound is not sharp, it can usually be improved by considering 
not only the derivatives i/ r ; but also higher derivatives: 


(6.28) 


1 3 il+ " +i ’p 0 (x) 

pe(x) 3 9\' ■ ■ ■ 3 9l’ ' 


It is then easy to generalize (5.31) and (5.24) to obtain a lower bound based on 
any given set S of the t// ’s. Assume (6.6) with (c) replaced by the corresponding 
assumption for all the derivatives needed for the set S, and suppose that the 
covariance matrix K{9) of the given set of i/r’s is positive definite. Then, (6.1) 
yields the Bhattacharyya inequality 

(6.29) vaig(8) > a'K~\9)a 


where a' is the row matrix with elements 


9h+-+h 

— - - -E g 8(X) = COv(<5, t/o-j, 

d9 1 ■■■del’ 


(6.30) 
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It is also seen that equality holds in (6.29) if and only if S is a linear func¬ 
tion of the t/r’s in S (Problem 6.12). The problem of whether the Bhat- 
tacharyya bounds become sharp as s —> oo has been investigated for some 
one-parameter cases by Blight and Rao (1974). 

(b) A different kind of extension avoids the need for regularity conditions by con¬ 
sidering differences instead of derivatives. (See Hammersley 1950, Chapman 
and Robbins 1951, Kiefer 1952, Fraser and Guttman 1952, Fend 1959, Sen 
and Ghosh 1976, Chatterji 1982, and Klaassen 1984, 1985.) 

(c) Applications of the inequality to the sequential case in which the number of 
observations is not a fixed integer but a random variable, say N, determined 
from the observations is provided by Wolfowitz (1947), Blackwell and Gir- 
shick (1947), and Seth (1949). Under suitable regularity conditions, (6.23) 
then continues to hold with n replaced by E g (N); see also Simons 1980, 
Govindarajulu and Vincze 1989, and Stefanov 1990. 

(d) Other extensions include arbitrary convex loss functions (Kozek 1976); 
weighted loss functions (Mikulski and Monsour 1988); to the case that g 
and S are vector-valued (Rao 1945, Cramer 1946b, Seth 1949, Shemyakin 
1987, and Rao 1992); to nonparametric problems (Vincze 1992); location 
problems (Klaassen 1984); and density estimation (Brown and Farrell 1990). 

7 Problems 

Section 1 

1.1 Verify (a) that (1.4) defines a probability distribution and (b) condition (1.5). 

1.2 InExample 1.5, show that a* minimizes (1.6) for i =0, 1, and simplify the expression 
for Aq. [Hint: E/rp* -1 and Y,k(k — l)p*~ 2 are the first and second derivatives of Ep* = 
!/<?•] 

1.3 Let X take on the values —1. 0, 1, 2, 3 with probabilities P(X = —1) = 2 pq and 
P(X = k) = p k q 3 ~ k for k = 0, 1,2,3. 

(a) Check that this is a probability distribution. 

(b) Determine the LMVU estimator at po of (i) p, and (ii) pq, and decide for each 
whether it is UMVU. 

1.4 For a sample of size n, suppose that the estimator T(x ) of r(9) has expectation 

00 n 

ElT(X)] = T(6) + J2- k ’ 

k= l n 

where may depend on 9 but not on n. 

(a) Show that the expectation of the jackknife estimator Tj of (1.3) is 

E[Tj(X)] = r(9) - °| + 0(l/n 3 ). 

n 2 

(b) Show that if var T ~ c/n for some constant c, then var Tj ~ c'/n for some 
constant c'. Thus, the jackknife will reduce bias and not increase variance. 

A second-order jackknife can be defined by jackknifing Tj, and this will result in further 
bias reduction, but may not maintain a variance of the same order (Robson and Whitlock 
1964; see also Thorburn 1976 and Note 8.3). 
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1.5 (a) Any two random variables X and Y with finite second moments satisfy the 

covariance inequality [cov(X, Y)] 2 < var(X) • var(F). 

(b) The inequality in part (a) is an equality if and only if there exist constants a and b 
for which P(X = aY + b) = 1. 

[Hint: Part (a) follows from the Schwarz inequality (Problem 1.7.20)with/ = X — E(X) 
and g = Y - E(Y).] 

1.6 An alternative proof of the Schwarz inequality is obtained by noting that 

j(f + \g?dP = / f 2 dP + 2X / fgdP+X 2 / g 2 dP > 0 for all A, 

so that this quadratic in A. has at most one root. 

1.7 Suppose X is distributed on (0, 1) with probability density p 0 (.x) = (1 — 9) + 6/2^/x 
for allO<JC<l,O<0<l. Show that there does not exist an LMVU estimator of 6. 
[Hint: Let <5 (jc) = a[x~ l/2 + b] for c < x < 1 and 8(x) = 0 for 0 < x < c. There exist 
values a and b, and c such that E 0 (8) = 0 and Ei(8) = 1 (and <5 is unbiased) and that 
E 0 (8 2 ) is arbitrarily close to zero (Stein 1950).] 

1.8 If 5 and 5' have finite variance, so does 8' — 8. [Hint: Problem 1.5.] 

1.9 In Example 1.9, (a) determine all unbiased estimators of zero; (b) show that no 
nonconstant estimator is UMVU. 

1.10 If estimators are restricted to the class of linear estimators, characterization of best 
unbiased estimators is somewhat easier. Although the following is a consequence of 
Theorem 1.7, it should be established without using that theorem. 

Let X pxl satisfy £(X) = Bfi and var(X) = /, where B pxr is known, and i/r,. xl is 
unknown. A linear estimator is an estimator of the form a'X, where a px i is a known 
vector. We are concerned with the class of estimators 

T> = (S(x) : <5(x) = a'x, for some known vector a}. 

(a) For a known vector c, show that the estimators in T> that are unbiased estimators 
of c' i/r satisfy a'B = c’. 

(b) Let T> c = (S(x) : <5(x) = a'x, a'B = d} be the class of linear unbiased estimators of 
dxlr. Show that the best linear unbiased estimator (BLUE) of dt[r, the linear unbi¬ 
ased estimator with minimum variance, is 5*(x) = a*'x, where a*' = a’B(B'B)~ l B' 
and a*'B = c with variance var(<5*) = dc. 

(c) Let T>o = ]<5(x) : <5(x) = a'x, a' B = 0.) be the class of linear unbiased estimators of 
zero. Show that if 8 e T> 0 , then cov(<5, 5*) = 0. 

(d) Hence, establish the analog of Theorem 1.7 for linear estimators: 

Theorem. An estimator <5* e V c satisfies var(5*) = mms e t> c var(5) if and only if 
cov(<5*, U) = 0, where U is any estimator in V 0 . 

(e) Show that the results here can be directly extended to the case of var(X) = E, where 
E pxp is a known matrix, by considering the transformed problem with X* = E 1/2 X 
and B* = E 1/2 B. 


1.11 Use Theorem 1.7 to find UMVU estimators of some of the t] S {dj) in the dose- 
response model (1.6.16), with the restriction (1.6.17) (Messig and Strawderman 1993). 
Let the classes A and W be defined as in Theorem 1.7. 

(a) Show that an estimator U e U if and only if U(xi, x-H = a[I(x i = 0) — I(x 2 = 0)] 
for an arbitrary constant a < 00 . 
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(b) Using part (a) and (1.7), show that an estimator 5 is UMVU for its expectation only 

if it is of the form S(x ,, x 2 ) = al (0 , o)C*i, * 2 ) +W(o,i),(i,o),( 2 ,o)C*i ,x 2 ) + c/ ( i, p(.ri , x 2 ) + 
dl( 2 ,i)(x\, x 2 ) where a, b , c, and d are arbitrary constants. 

(c) Show that there does not exist a UMVU estimator of qg(di) = 1 — e ~ e , but the 
UMVU estimator of r)g(d 2 ) = 1 — e _M is 5 (jci, x 2 ) = 1 — \[I(x 1 = 0) + I(x 2 = 0)]. 

(d) Show that the LMVU estimator of 1 — e~ e is S(xi,x 2 ) = y •+ 2(1+ ^_ g) [/(jc 1 = 
0) — I (x 2 = 0)]. 

1.12 Show that if 5(X) is a UMVU estimator of g(9), it is the unique UMVU estimator 
of g{9). (Do not assume completeness, but rather use the covariance inequality and the 
conditions under which it is an equality.) 

1.13 If and S 2 are in A and are UMVU estimators of gi(9) and g 2 (0), respectively, 
then ai<5i +a 2 S 2 is also in A and is UMVU for estimating fligi(@) + a 2 g 2 (9), for any real 
a i and a 2 . 

1.14 Completeness of T is not only sufficient but also necessary so that every g(0) that 
can be estimated unbiasedly has only one unbiased estimator that is a function of T. 

1.15 Suppose X \, ..., X„ are iid Poisson (X). 

(a) Show that X is the UMVU estimator for X. 

(b) For S 2 = ZliM — X) 2 /( n ~ 1). we have that ES 2 = EX = X. To directly establish 
that var S 2 > var X, prove that £(S 2 |X) = X. 

Note'. The identity £(S 2 |X) = X shows how completeness can be used in calculating 
conditional expectations. 

1.16 (a) If Xi, ..., X„ are iid (not necessarily normal) with var(X ; ) = a 2 < 00 , show 
that S = S(A,- — X) 2 /(n — 1) is an unbiased estimator of a 2 . 

(b) If the Xi take on the values 1 and 0 with probabilities p and q = 1 — p, the estimator 
S of (a) depends only on T = EX,- and hence is UMVU for estimating a 2 = pq. 
Compare this result with that of Example 1.13. 

1.17 If T has the binomial distribution b(p , n) with n > 3, use Method 1 to find the 
UMVU estimator of p 3 . 

1.18 Let Xj.X„ be iid according to the Poisson distribution P(X). Use Method 1 to 

find the UMVU estimator of (a) X k for any positive integer k and (b) e~ x . 

1.19 Let Xj,..., X„ be distributed as in Example 1.14. Use Method 1 to find the UMVU 
estimator of 9 k for any integer k > — n. 

1.20 Solve Problem 1.18(b) by Method 2, using the fact that an unbiased estimator of 
e~ x is S = 1 if Xi = 0, and 5 = 0 otherwise. 

1.21 In n Bernoulli trials, let X,- = 1 or 0 as the /th trial is a success or failure, and let 
T = EX ; . Solve Problem 1.17 by Method 2, using the fact that an unbiased estimator 
of p 3 is S = 1 if Xi = X 2 = X3 = 1, and 5 = 0 otherwise. 

1.22 Let X take on the values 1 and 0 with probability p and q. respectively, and assume 
that 1/4 < p < 3/4. Consider the problem of estimating p with loss function L{p, d) = 
1 if \d — p\ > 1/4. and 0 otherwise. Let 5* be the randomized estimator which is F 0 or 
Fi when X = 0 or 1 where To and Ti are distributed as U(— 1/2, 1/2) and U(l/2, 3/2), 
respectively. 

(a) Show that 5* is unbiased. 

(b) Compare the risk function of 5* with that of X. 
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Section 2 

2.1 If Xi, ..., X n are iid as IV (f, a 2 ) with a 2 known, find the UMVU estimator of (a) 
f 2 , (b) f 3 , and (c) f 4 . [Hint: To evaluate the expectation of X k , write X = Y + f, where 
Y is IV(0, a 2 /n) and expand E{Y + f )*.] 

2.2 Solve the preceding problem when a is unknown. 

2.3 In Example 2.1 with a known, let 8 = Ec,A, be any linear estimator of f . If <5 is 
biased, show that its risk E(S — f) 2 is unbounded. [Hint: If = 1 + k, the risk is 

> k 2 f.] 

2.4 Suppose, as in Example 2.1, that X t , ..., X n are iid as IV (f, cr 2 ), with one of the 
parameters known, and that the estimand is a polynomial in f or a. Then, the UMVU 
estimator is a polynomial in X or S 2 = ^(X,- — f ) 2 . The variance of any such polynomial 
can be estimated if one knows the moments E(X k ) and E(S k ) for all k = 1, 2, .... To 
determine E(X k ), write X = Y +f, where Y is distributed as IV(0, a 2 /n). Show that 
(a) 


E(X k ) : 


E 


% k ~ r E(Y r ) 


with 


E{Y r ) 


(r — I )(r — 3) • • • 3 ■ 1 (cr 2 /ny /2 when r > 2 is even 
0 when r is odd . 


(b) As an example, consider the UMVU estimator S 2 /n of a 2 . Show that £(S 4 ) = 
n(n + 2 )a 2 and var ~ and that the UMVU estimator of this variance is 

2 S 4 /n 2 {n + 2). 

2.5 In Example 2.1, when both parameters are unknown, show that the UMVU estimator 
off 2 is given by 8 = X 2 — where now S 2 = £(X,- - X) 2 . 

2.6 (a) Determine the variance of the estimator Problem 2.5. 

(b) Find the UMVU estimator of the variance in part (a). 

2.7 If A is a single observation from V(f, a 2 ), show that no unbiased estimator 5 of 
coexists when f is unknown. [Hint: For fixed a = a, X is a complete sufficient statistic 
for f, and E[8{X)\ = a 2 for all f implies S(x) = a 2 a.e.] 

2.8 Let Xj, i = 1, ..., n, be independently distributed as N(a + f)t t , a 2 ) where a, (5, and 
a 2 are unknown, and the t’s are known constants that are not all equal. Find the UMVU 
estimators of a and /l. 

2.9 In Example 2.2 with n = 1, the UMVU estimator of p is the indicator of the event 
X{ < u whether a is known or unknown. 

2.10 Verify Equation (2.14), the density of ( X l — X)/S in normal sampling. [The UMVU 
estimator in (2.13) is used by Kiefer (1977) as an example of his estimated confidence 
approach.] 

2.11 Assuming (2.15) with a = t, determine the UMVU estimators of <7 2 and (rj — f )/cr. 

2.12 Assuming (2.15) with )) = f and <t 2 /t 2 = y, show that when y is known: 


(a) T' defined in Example 2.3(iii) is a complete sufficient statistic; 

(b) 8 y is UMVU for f. 

2.13 Show that in the preceding problem with y unknown. 
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(a) a UMVU estimator of £ does not exist; 

(b) the estimator § is unbiased under the conditions stated in Example 2.3. [Hint: (i) 
Problem 2.12(b) and the fact that S v is unbiased for § even when a 2 / r 2 ^ y. (ii) 
Condition on (S x . 5V).] 

2.14 For the model (2.15) find the UMVU estimator of P(Xi < l^) when (a) cr = r and 
(b) when a and r are arbitrary. [Hint: Use the conditional density (2.13) of Xi given 
X. Sy and that of given Y. Sy to determine the conditional density of 1) — given 
X, Y. S' 2 , and S 2 .] 

2.15 If (Xi, Ei),..., ( X „, Y„) are iid according to any bivariate distribution with finite 
second moments, show that Sxr/(n — 1) given by (2.17) is an unbiased estimator of 
cov(X ; , YO. 

2.16 In a sample size N = n + k + 1, some of the observations are missing. Assume that 
(X,-, Y f ), i = 1are iid according to the bivariate normal distribution (2.16), and 
that Ui, ..., Uk and V),..., V/ are independent N(t-, a 2 ) and N(>], r 2 ), respectively. 

(a) Show that the minimal sufficient statistics are complete when f and rj are known 
but not when they are unknown. 

(b) When § and r/ are known, find the UMVU estimators for a 2 , r 2 , and par, and 
suggest reasonable unbiased estimators for these parameters when f and r) are 
unknown. 

2.17 For the family (2.22), show that the UMVU estimator of a when b is known and the 
UMVU estimator of b is known are as stated in Example 2.5. [Hint: Problem 6.18.] 

2.18 Show that the estimators (2.23) are UMVU. [Hint: Problem 1.6.18.]. 

2.19 For the family (2.22) with b = 1, find the UMVU estimator of P(X\ > u) and 

of the density e~ il ‘~ a) of X t at u. [Hint: Obtain the estimator 5(X (1) ) of the density by 
applying Method 2 of Section 2.1 and then the estimator of the probability by integration. 
Alternatively, one can first obtain the estimator of the probability as P(X i > w[X ( d) 
using the fact that Vj — X^ t is ancillary and that given IQi), X, is either equal to X (l) or 
distributed as 1).] 

2.20 Find the UMVU estimator of P(X\ > u) for the family (2.22) when both a and b 
are unknown. 

2.21 Let Xi,, X m and Yi , ..., Y„ be independently distributed as E(a , b) and E(a' ,b'), 
respectively. 

(a) If a, b, a', and b' are completely unknown, X ( ij, Y, jd, E[X, — X(p], and Y,[Yj — y ( i,] 
jointly are sufficient and complete. 

(b) Find the UMVU estimators of a 1 — a and b'/b. 

2.22 In the preceding problem, suppose that b' = b. 

(a) Show that X(i,, F ( i), and S[X ; — X ( d] + Y.\Yj — L^] are sufficient and complete. 

(b) Find the UMVU estimators of b and ( a' — a)/b. 

2.23 In Problem 2.21, suppose that a' = a. 

(a) Show that the complete sufficient statistic of Problem 2.21(a) is still minimal suf¬ 
ficient but no longer complete. 

(b) Show that a UMVU estimator for a' = a does not exist. 

(c) Suggest a reasonable unbiased estimator for a' = a. 
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2.24 Let Xi , ..., X n be iid according to the uniform distribution Z7(£ — b, £ + b). If b 
are both unknown, find the UMVU estimators of £, b, and %/b. [Hint: Problem 1.6.30.] 

2.25 Let Xi,.... X m and Yi, ..., Y„ be iid as U(0, 9) and t/(0, 9'), respectively. If n > 1, 
determine the UMVU estimator of 9/9'. 

2.26 Verify the ML estimators given in (2.24). 

2.27 In Example 2.6(b), show that 

(a) The bias of the ML estimator is 0 when f = u. 

(b) At $ = u, the ML estimator has smaller expected 
estimator. 

[Hint: In (b), note that u — X is always closer to 0 than 

2.28 Verify (2.26). 

2.29 Under the assumptions of Lemma 2.7, show that: 

(a) If b is replaced by any random variable B which 
with probability 1, then R$(9) < Rg*(9). 

(b) If squared error is replaced by any loss function of the form L(9 , <5) = p(d — 9) and 
5 is risk unbiased with respect to L , then Rg(9) < Rg*(9). 


squared error than the UMVU 




is independent of X and not 0 


Section 3 

3.1 (a) In Example 3.1, show that E(Z ; — X) 2 = Tin — T)/n. 

(b) The variance of T(n — T)/n(n — 1) in Example 3.1 is ( pq/n)[(q — p) 2 +2pq / (n — 1)]. 

3.2 If T is distributed as b[p , n), find an unbiased estimator S(T) of p m ( m < n) by 
Method 1, that is, using (1.10). [Hint: Example 1.13.] 

3.3 (a) Use the method leading to (3.2) to find the UMVU estimator rt^T) ofP[X 1 + 

■■■+X m =k] = ^J p k q"- k (m < n). 

(b) For fixed t and varying k, show that the 7Z>(t) are the probabilities of a hypergeo¬ 
metric distribution. 

3.4 If Y is distributed according to (3.3), use Method 1 of Section 2.1 

(a) to show that the UMVU estimator of p r ( r < m ) is 

(m — r + v — 1 )(m — r + v — 2)... (m — r) 

S(y ) = - ; - ; -, 

(m + y — 1 )(m + y — 2) ■ • ■ m 

and hence in particular that the UMVU estimator of 1/p, 1/p 2 and p are, respec¬ 
tively, (m + y)/m, (m + y)(m + y + 1 )/m{m + 1), and (m — 1 )/(m + y — 1); 

(b) to determine the UMVU estimator of var(T); 

(c) to show how to calculate the UMVU estimator <5 of log p. 

3.5 Consider the scheme in which binomial sampling is continued until at least a suc¬ 
cesses and b failures have been obtained. Show how to calculate a reasonable estimator of 
log (p/q). [Hint: To obtain an unbiased estimator of log p, modify the UMVU estimator 
S of Problem 3.4(c).] 

3.6 If binomial sampling is continued until m successes have been obtained, let Xi (i = 
1, •.., m) be the number of failures between the ( i — l)st and ith success. 

(a) The X t are iid according to the geometric distribution P(X, = x) = pq x , x = 
0 , 1 . 
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(b) The statistic Y = EX,- is sufficient for (Zj,..., Z,„) and has the distribution (3.3). 

3.7 Suppose that binomial sampling is continued until the number of successes equals 
the number of failures. 

(a) This rule is closed if p = 1/2 but not otherwise. 

(b) If p = 1/2 and N denotes the number of trials required, E{N) = oo. 

3.8 Verify Equation (3.7) with the appropriate definition of N\x , y) (a) for the estimation 
of p and (b) for the estimation of p a q b . 

3.9 Consider sequential binomial sampling with the stopping points (0, 1) and (2, y), 
y = 0, 1,.... (a) Show that this plan is closed and simple, (b) Show that (X, Y ) is not 
complete by finding a nontrivial unbiased estimator of zero. 

3.10 In Example 3.4(ii). (a) show that the plan is closed but not simple, (b) show that 
(X, T) is not complete, and (c) evaluate the unbiased estimator (3.7) of p. 

3.11 Curtailed single sampling. Let a.b < n be three non-negative integers. Continue 
observation until either a successes, b failures, or n observations have been obtained. 
Determine the UMVU estimator of p. 

3.12 For any sequential binomial sampling plan, the coordinates (X, Y ) of the end point 
of the sample path are minimal sufficient. 

3.13 Consider any closed sequential binomial sampling plan with a set B of stopping 
points, and let B' be the set BU{(.t 0 , yo)} where (.v 0 , Vo) is a point not in B that has positive 
probability of being reached under plan B. Show that the sufficient statistic T = (X, Y ) is 
not complete for the sampling plan which has B' as its set of stopping points. [Hint: For 
any point ( x , y) e B, let N(x, y) and N'{x, y) denote the number of paths to (x, y) when 
the set of stopping points is B and B'. respectively, and let N(x o, yo) = 0, N'(xo, Vo) = 1 • 
Then, the statistic 1 — [N(X, Y)/N'(X, Y)\ has expectation 0 under B' for all values of 
P-] 

3.14 For any sequential binomial sampling plan under which the point (1, 1) is reached 
with positive probability but is not a stopping point, find an unbiased estimator of pq 
depending only on (Z, Y). Evaluate this estimator for 

(a) taking a sample of fixed size n > 2; 

(b) inverse binomial sampling. 

3.15 Use (3.3) to determine A(t , n) in (3.11) for the negative binomial distribution with 
m = n , and evaluate the estimators (3.13) of q r , and (3.14). 

3.16 Consider n binomial trials with success probability p, and let r and i be two positive 
integers with r + s < n. To the boundary x + y = n, add the boundary point (r, s), that 
is, if the number of successes in the first r + s trials is exactly r, the process is stopped 
and the remaining n — (r + s) trials are not performed. 

(a) Show that U is an unbiased estimator of zero if and only if U ( k , n — k) = 0 for 
k = 0, 1,..., r — 1 and k = n — s + 1, n — s + 2, ..., n, and U(k, n — k) = CkU{r, s) 
for k = r, ..., n — s, where the c’s are given constants y' 0. 

(b) Show that 5 is the UMVU estimator of its expectation if and only if 

S(k, n — k) = <5(r, s) for k = r, ..., n — s. 

3.17 Generalize the preceding problem to the case that two points (r,, sf) and (r 2 , .sy) 
with r,- + Sj < n are added to the boundary. Assume that these two points are such that 
all n + 1 points x + y = n remain boundary points. [Hint: Distinguish the three cases 
that the intervals (/',, sD and (r 2 , s 2 ) are (i) mutually exclusive, (ii) one contained in the 
other, and (iii) overlapping but neither contained in the other.] 



136 


UNBIASEDNESS 


[2.7 


3.18 If X has the Poisson distribution P(9), show that 1/6 does not have an unbiased 
estimator. 


3.19 If X\, ..., X„ are iid according to (3.18), the Poisson distribution truncated on the 
left at 0, find the UMVU estimator of 6 when (a) n = 1 and (b) n = 2. 

3.20 Let X i,.... X n be a sample from the Poisson distribution truncated on the left at 0, 
i.e., with sample space X = (1, 2, 3, ...). 


(a) For t = Exj, the UMVU estimator of k is (Tate and Goen 1958) k = where 
C" = Ylkto f ^ 1 ( — 1 )*U is a Stirling number of the second kind. 


(b) An alternate form of the UMVU estimator is k 
the identity C" = C’/fl + nC"_ ,.] 


1 — ~^yr~ ) ■ [Hint: Establish 


(c) The Cramer-Rao lower bound for the variance of unbiased estimators of k is A(1 — 
e~ x ) 2 /[n( 1 — e~ x — Ac - *)], and it is not attained by the UMVU estimator. (It is, 
however, the asymptotic variance of the ML estimator.) 


3.21 Suppose that X has the Poisson distribution truncated on the right at a, so that it 
has the conditional distribution of Y given Y < a, where Y is distributed as P(k). Show 
that k does not have an unbiased estimator. 

3.22 For the negative binomial distribution truncated at zero, evaluate the estimators 
(3.13) and (3.14) for m = 1, 2, and 3. 

3.23 If Xi,. .. , X n are iid P(k), consider estimation of e~ bx , where b is known. 

(a) Show that 6* = (1 — b/n)‘ is the UMVU estimator of e~ bx . 

(b) For b > n, describe the behavior of <5*, and suggest why it might not be a reasonable 
estimator. 


(The probability e~ bx . for b > n, is that of an “unobservable” event, in that it can be 
interpreted as the probability of no occurrence in a time interval of length b. A number 
of such situations are described and analyzed in Lehmann (1983), where it is suggested 
that, in these problems, no reasonable estimator may exist.) 

3.24 If Xi, ..., X n are iid according to the logarithmic series distribution of Problem 
1.5.14, evaluate the estimators (3.13) and (3.14) forn = 1, 2, and 3. 

3.25 For the multinomial distribution of Example 3.8, 

(a) show that pg° ■ ■■ pf has an unbiased estimator provided r 0 , .... r s are nonnegative 
integers with 5>,- < n; 

(b) find the totality of functions that can be estimated unbiasedly; 

(c) determine the UMVU estimator of the estimand of (a). 

3.26 In Example 3.9 when p t j = pi+p+j, determine the variances of the two unbiased 
estimators S 0 = n,j/n and Si = n i+ n+j/n 2 of p t j, and show directly that var(S 0 ) > var(Si ) 
for all n > 1. 

3.27 In Example 3.9, show that independence of A and B implies that (ni+,..., «/+) and 
(n+i, ..., n + j) are independent with multinomial distributions as stated. 

3.28 Verify (3.20). 

3.29 Let A, Y, and g be such that E[g(X, T)|y] is independent of y. Then, E[f(Y)g(X, Y)] 
E[f(Y)]E[g(X, L)], and hence f(Y) and g{X, Y) are uncorrelated, for all /. 
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3.30 In Example 3.10, show that the estimator 5 1 of p ijk is unbiased for the model (3.20). 
[Hint: Problem 3.29.] 


Section 4 

4.1 Let Xi, ..., X n be iid with distribution F. 

(a) Characterize the totality of functions f(X\, ... ,X„) which are unbiased estimators 
of zero for the class To of all distributions F having a density. 

(b) Give one example of a nontrivial unbiased estimator of zero when (i) n = 2 and (ii) 
n = 3. 

4.2 Let T be the class of all univariate distribution functions F that have a probability 
density function f and finite mth moment. 

(a) Let X{ . X„ be independently distributed with common distribution F e T. 

For n > m, find the UMVU estimator of where § = £(F) = EX t . 

(b) Show that for the case that PiX, = 1) = p, P(X t = 0) = q, p +q = 1, the estimator 
of (a) reduces to (3.2). 

4.3 In the preceding problem, show that 1/var F X t does not have an unbiased estimator 
for any n. 

4.4 Let X [, ..., X„ be iid with distribution F € T where T is the class of all symmetric 
distributions with a probability density. There exists no UMVU estimator of the center 
of symmetry 9 of F (if unbiasedness is required only for the distributions F for which 
the expectation of the estimator exists). [Hint: The UMVU estimator of 9 when F is 
U(9 — 1/2, 9 + 1/2), which was obtained in Problem 2.24, is unbiased for all F e T; 
so is X.] 

4.5 If Xi , ..., X m and Yi,... ,Y n are independently distributed according to F and G e 
To, defined in Problem 4.1, the order statistics < ■ ■ ■ < X( ml and I'd, < ■ ■ ■ < 
Y (n) are sufficient and complete. [Hint: For completeness, generalize the second proof 
suggested in Problem 6.33.] 

4.6 Under the assumptions of the preceding problem, find the UMVU estimator of 
P(X, < Yj). 

4.7 Under the assumptions of Problem 4.5, let £ = EX t and r/ = EYj. Show that 
possesses an unbiased estimator if and only if m > 2 and n > 2. 

4.8 Let (Xi, Yi) .( X„, Y n ) be iid F e T. where T is the family of all distributions 

with probability density and finite second moments. Show that S(X, T) = ]U(Xj — 
X)(Yj - Y)/(n - 1) is UMVU for cov(X, F). 

4.9 Under the assumptions of the preceding problem, find the UMVU estimator of 

(a) P(X; < Yj); 

(b) P(X, < X, and Y, < F ; ), ijj. 

4.10 Let(Xi, Ff). (X n , F„) be iid with F e T, where T is the family of all bivariate 

densities. Show that the sufficient statistic T, which generalizes the order statistics to 
the bivariate case, is complete. [Hint: Generalize the second proof suggested in Problem 
6.33. As an exponential family for (X, F), take the densities proportional to e^ <x,y> where 

Q(x, y) = ($ot x + 9 lQ y) + ($ 02 * 2 + Onxy + O^y 2 ) + ■■■ 

+{9 0n x n + -- +9 n oy n )-x 2 " -y 2 ".] 
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Section 5 

5.1 Under the assumptions of Problem 1.3, determine for each pi, the value L v (p l ) of 
the LMVU estimator of p at pi and compare the function L v (p), 0 < p < 1 with the 
variance V P0 (p) of the estimator which is LMVU at (a) po = 1/3 and (b) po = 1/2. 

5.2 Determine the conditions under which equality holds in (5.1). 

5.3 Verify 1(8) for the distributions of Table 5.1. 

5.4 If X is normal with mean zero and standard deviation a. determine 1(a). 

5.5 Find I(p) for the negative binomial distribution. 

5.6 If X is distributed as P(X), show that the information it contains about \/k is inde¬ 
pendent of X. 

5.7 Verify the following statements, asserted by Basu (1988, Chapter 1), which illus¬ 
trate the relationship between information, sufficiency, and ancillarity. Suppose that 
we let 1(6) = Eg [— d 2 /d8 2 log f(x\(8)\ be the information in X about 8 and let 
J(8) = Eg [— d 2 /dd 2 log g(T\6)\ be the information about 8 contained in a statistic 
T , where g(-|$) is the density function of T. Define X(6) = 1(8) — J(8), a measure of 
information lost by using T instead of X. Under suitable regularity conditions, show 
that 

(a) X(8) > 0 for all 6 

(b) X(8) = 0 if and only if T is sufficient for 8. 

(c) If Y is ancillary but (T, F) is sufficient, then 1(8) = Eg[J(9\Y)\, where 

J(8\y) = E e — log h(T\y, 6)\Y = y 

and h(t\y, 6) is the conditional density of T given Y = y. 

(Basu’s “regularity conditions’" are mainly concerned with interchange of integration 
and differentiation. Assume any such interchanges are valid.) 

5.8 Find a function of 8 for which the amount of information is independent of 9: 

(a) for the gamma distribution F(o?, /3) with a known and with 8 = f); 

(b) for the binomial distribution b(p, n) with 8 = p. 

5.9 For inverse binomial sampling (see Example 3.2): 

(a) Show that the best unbiased estimator of p is given by 5*(F) = (m— 1 )/(Y+m — 1). 

(b) Show that the information contained in Y about P is I(p) = m/p 2 ( 1 — p). 

(c) Show that var<5* > l/l(p). 

(The estimator 5* can be interpreted as the success rate if we ignore the last trial, which 
we know must be a success.) 

5.10 Show that (5.13) can be differentiated by differentiating under the integral sign 
when pg(x) is given by (5.24), for each of the distributions of Table 5.2. [Hint: Form the 
difference quotient and apply the dominated convergence theorem.] 

5.11 Verify the entries of Table 5.2. 

5.12 Evaluate (5.25) when / is the density of Student’s f-distribution with v degrees of 
freedom. [Hint: Use the fact that 

00 dx _ r(l/2)F(* — 1/2) 

oo (1+x 2 )* " r«) 
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5.13 For the distribution with density (5.24), show that 1(8) is independent of 6. 

5.14 Verify (a) formula (5.25) and (b) formula (5.27). 

5.15 For the location t density, calculate the information inequality bound for unbiased 
estimators of 6. 


5.16 (a) For the scale family with density ( I /0)f(x/0), 6 > 0, the amount of informa¬ 
tion a single observation X has about 6 is 


1 [ \ y f,<y) 
e 2 J [ f(y) 


2 


+1 


f(y)dy. 


(b) Show that the information X contains about § = logP is independent of 6. 

(c) For the Cauchy distribution C(0, 6), 1(6) = 1/(2 6 2 ). 

5.17 If pe(x) is given by 1.5.1 with s = 1 and T(x) = <5(.r), show that var[<5(X)] attains 
the lower bound (5.31) and is the only estimator to do so. [Hint: Use (5.18) and (1.5.15).] 

5.18 Show that if a given function g(6) has an unbiased estimator, there exists an unbi¬ 
ased estimator <5 which for all 6 values attains the lower bound (5.1) for some x/r(x, 6) 
satisfying (5.2) if and only if g(6) has a UMVU estimator So. [Hint: By Theorem 5.1, 
\jr(x, 6) = 5 0 (x) satisfies (5.2). For any other unbiased <5, cov(<5 — <5 0 , S 0 ) = 0 and hence 
var(5o) = [cov(5, 5 0 )] 2 /var(5 0 ), so that ij/ = S 0 provides an attainable bound.] (Blyth 
1974). 

5.19 Show that if EgS = g(6), and var(<5) attains the information inequality bound (5.31), 
then 

p f , , g'(9) 3 , , 

S(x) = g(6) -i - Pe(x). 

’ s ; 1(6) 86 F 

5.20 If E s S = g(6 ), the information inequality lower bound is 1 B(6) = [g'(6)] 2 /1(8). If 
8 = h(^) where h is differentiable, show that IB(%) = IB(6). 

5.21 (Liu and Brown 1993) Let X be an observation from the normal mixture density 

Pe(x) = L-0/2X*-*)* + e -( 1 /2)(^) 2 } t e e a t 

2V2n I J 

where 12 is any neighborhood of zero. Thus, the random variable X is either N(6, 1) or 
N(—6 , 1), each with probability 1 /2. Show that 8 = 0 is a singular point, that is, if there 
exists an unbiased estimator of 6 it will have infinite variance at 6 = 0. 

5.22 Let Xi, .... X„ be a sample from the Poisson (X) distribution truncated on the 
left at 0, i.e., with sample space X = {1, 2, 3,...} (see Problem 3.20). Show that the 
Cramer-Rao lower bound for the variance of unbiased estimators of X is 

2.(1 - e~ x ) 2 
n( 1 — e~ x — Xe~ x ) 

and is not attained by the UMVU estimator. (It is, however, the asymptotic variance of 
the ML estimator.) 

5.23 Let ... ,X n be iid according to a density p(x, 6) which is positive for all x. 
Then, the variance of any unbiased estimator S of 8 satisfies 


var % (5) > 


(6 - 6 q ) 2 

rOC [p(X,6)[ 2 1” _ 

J -°° p(x,6 0 ) I 


6^60. 


[Hint: Direct consequence of (5.6).] 
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5.24 If Xi, ..., X n are iid as N(9, a 2 ) where a is known and 9 is known to have one 
of the values 0, ±1, ±2, ..., the inequality of the preceding problem shows that any 
unbiased estimator <5 of the restricted parameter 9 satisfies 

Var *oW ^ _ ! ■ A ^ 0 ' 

where A = 9 — 9o, and hence sup/^var^fTi) > \/[e nlal — 1], 

5.25 Under the assumptions of the preceding problem, let X* be the integer closest to X. 

(a) The estimator X* is unbiased for the restricted parameter 9. 

(b) There exist positive Constantsa and b such that for all sufficiently large n, var^fX*) < 
ae~ bn for all integers 9. 

[Hint: (b) One finds P(X* = k) = <p(t)dt, where h is the interval ((k — 9 — 

1 /2 )*Jn/o, (k — 9 + 1 /2)y/n/a), and hence 



The result follows from the fact that for all y > 0, 1 — <t>( v) < (p(y)/y. See, for 
example. Feller 1968, Chapter VII, Section 1. Note that h(y) = <j>(y)/( 1 — <T>(_y)') is 
the hazard function for the standard normal distribution, so we have h(y) > y for 
all y > 0. (1 — 4>(_y))/</>(y) is also known as Mill’s ratio (see Stuart and Ord, 1987, 
Section 5.38.) Efron and Johnstone (1990) relate the hazard function to the information 
inequality]. 


Note. The surprising results of Problems 5.23-5.25 showing a lower bound and variance 
which decrease exponentially are due to Hammersley (1950), who shows that, in fact, 

var(X*) ~ ,/^<r" /8,r2 as -► oo. 

V 7r n a 2 

Further results concerning the estimation of restricted parameters and properties of 
X* are given in Khan (1973), Ghosh (1974), Ghosh and Meeden (1978), and Kojima, 
Morimoto, and Takeuchi (1982). 

5.26 Kiefer inequality. 


(a) Let X have density (with respect to /z) p{x, 9) which is > 0 for all x, and let 
and A 2 be two distributions on the real line with finite first moments. Then, any 
unbiased estimator S of 9 satisfies 


[f AdAfA)-f AdA 2 (A)] 2 

var(S) > -A ——--- 

/ ff 2 (x,8)p{x,9ldp.{x) 


where 


tf(x, 9) 


Ia e P( x ’ 9 + A)[dAi(A) - d A 2 (A)] 

P(x, 9) 


with = {A : 9 + A e £2j. 

(b) If Ai and A 2 assign probability 1 to A = 0 and A, respectively, the inequality 
reduces to (5.6) with g{9 ) = 9. [Hint: Apply (5.1).] (Kiefer 1952.) 


5.27 Verify directly that the following families of densities satisfy (5.38). 
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(a) The exponential family of (1.5.1), 

p„(x) = 

(b) The location t family of Example 5.16. 

(c) The logistic density of Table 1.4.1. 

5.28 Extend condition (5.38) to vector-valued parameters, and show that it is satisfied by 
the exponential family (1.5.1) for s > 1. 

5.29 Show that the assumption (5.36(b)) implies (5.38), so Theorem 5.15 is, in fact, a 
corollary of Theorem 5.10. 

5.30 Show that (5.38) is satisfied if either of the following is true: 

(a) |3 log pe/d9\ is bounded. 

(b) [pe+ aO) - M*)]/ A -»• 9 log Pe/S0 uniformly. 

5.31 (a) Show that if (5.38) holds, then the family of densities is strongly differentiable 
(see Note 8.6). 

(b) Show that weak differentiability is implied by strong differentiability. 

5.32 Brown and Gajek (1990) give two different sufficient conditions for (8.2) to hold, 
which are given below. Show that each implies (8.2). (Note that, in the progression from 
(a) to (b) the conditions become weaker, thus more widely applicable and harder to 
check.) 


(a) For some B < oo. 


—p e lX)/p eo {X) 


< B 


for all 6 in a neighborhood of 6. 
(b) If p*(x) = d/d0p e {x)\ e= „ then 


lim E, 


% 


pW*> ~ PH xy 

PeffX) 


: 0 . 


5.33 Let T be the class of all unimodal symmetric densities or, more generally, densities 
symmetric around zero and satisfying f(x) < /(0) for all x. Show that 

f 9 1 

min / x f(x)dx = —, 

/W 12 

and that the minimum is attained by the uniformf — ^, |) distribution. Thus, the uniform 
distribution has minimum variance among symmetric unimodal distributions. (See Ex¬ 
ample 4.8.6 for large-sample properties of the scale uniform.) [Hint: The side condition 
/ f(x)dx = 1, together with the method of undetermined multipliers, yields an equiv¬ 
alent problem, minimization of f (x 1 — a 2 )f(x)dx, where a is chosen to satisfy the 
constraint. A Neyman-Pearson type argument will now work.] 


Section 6 

6.1 For any random variables (i..., x// s ), show that the matrices 11E i//,- 1 //^ 11 and C = 
||cov(t/f,-, i/^)! | are positive semidefinite. 
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6.2 In this problem, we establish some facts about eigenvalues and eigenvectors of square 
matrices. (For a more general treatment, see, for example, Marshall and Olkin 1979, 
Chapter 20.) 

We use the facts that a scalar X > 0 is an eigenvalue of the n x n symmetric matrix 
A if there exists ann x 1 vector p, the corresponding eigenvector , satisfying Ap = Xp. 
If A is nonsingular, there are n eigenvalues with corresponding linearly independent 
eigenvectors. 


(a) Show that A = P' D X P , where D x is a diagonal matrix of eigenvalues of A and P 
is and n x n matrix whose rows are the corresponding eigenvalues that satisfies 
P'P = PP' = I, the identity matrix. 

(b) Show that max, !L -^ L = largest eigenvalue of A. 

(c) If B is a nonsingular symmetric matrix with eigenvector-eigenvalue represen¬ 
tation B = Q'DpQ , then max* = largest eigenvalue of A*, where A* = 
Dp QAQ Dp and Dp is a diagonal matrix whose elements are the recipro¬ 
cals of the square roots of the eigenvalues of B. 

(d) For any square matrices C and D, show that the eigenvalues of the matrix CD are 
the same as the eigenvalues of the matrix DC, and hence that max, yj^ = largest 
eigenvalue of AB^ 1 . 

(e) If A = aa', where a is a n x 1 vector (A is thus a rank-one matrix), then max, x = 
a’B~ l a. 

* * ^ 2 

[Hint: For part (b) show that yyy = y -yf- = ^Mr-, where y = Px, and hence the 
maximum is achieved at the vector y that is 1 at the coordinate of the largest eigenvalue 
and zero everywhere else.] 

6.3 An alternate proof of Theorem 6.1 uses the method of Lagrange (or undetermined) 
multipliers. Show that, for fixed y, the maximum value of a'y, subject to the constraint 
that a'Ca = 1, is obtained by the solutions to 

9 | , 1 1 

— {a y - X[a Ca — 1]} = 0, 

9a,- | 2 | 


where X is the undetermined multiplier. (The solution is a = ±C~ 1 y/ ■\Jy'C~ l y .) 

6.4 Prove (6.11) under the assumptions of the text. 

6.5 Verify (a) the information matrices of Table 6.1 and (b) Equations (6.15) and (6.16). 

6.6 If p(x) = (1 —s)<p(x — t;) + (e/T)<p[(x — f)/r] where (f> is the standard normal density, 
find 7(e, §, r). 

6.7 Verify the expressions (6.20) and (6.21). 

6.8 Let A=(^ ll ^ l “)bea partitioned matrix with A22 square and nonsingular, and 

V A 21 A 2 2 ) 


let 


B = 


( I — A 12 A 22 \ 

\0 1 ) 


Show that |A[ — |An — A^A^AtiI • |A 22 [. 

6.9 (a) Let 



where a is a scalar and b a column matrix, and suppose that A is positive definite. 
Show that | A| < a|C| with equality holding if and only if b = 0. 
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(b) More generally, if the matrix A of Problem 6.8 is positive definite, show that | A| < 

| An | • | A 22 I with equality holding if and only if A 12 = 0. 

[Hint: Transform An and the positive semidefinite Ai2A7 2 1 A2i simultaneously to diag¬ 
onal form.] 

6.10 (a) Show that ifthe matrix A is nonsingular, thenforany vectorx, (x'A.r)(x'A _1 x:) > 
(x'x) 2 . 

(b) Show that, in the notation of Theorem 6.6 and the following discussion, 

(e'a) 2 

hm = e' i I(0)e i ’ 

and if a = (0.0, a, , 0,... 0), a'I(9)~ l a = ( e' i a) 2 e' i I(9)~ 1 Si , and hence estab¬ 

lish (6.25). 

6.11 Prove that (6.26) is necessary for equality in (6.25). [Hint: Problem 6.9(a).] 

6.12 Prove the Bhattacharyya inequality (6.29) and show that the condition of equality 
is as stated. 

8 Notes 

8.1 Unbiasedness and Information 

The concept of unbiasedness as “lack of systematic error” in the estimator was introduced 
by Gauss (1821) in his work on the theory of least squares. It has continued as a basic 
assumption in the developments of this theory since then. 

The amount of information that a data set contains about a parameter was introduced by 
Edgeworth (1908, 1909) and was developed more systematically by Fisher (1922 and 
later papers). The first version of the information inequality, and hence connections with 
unbiased estimation, appears to have been given by Frechet (1943). Early extensions 
and rediscoveries are due to Darmois (1945), Rao (1945), and Cramer (1946b). The des¬ 
ignation “information inequality,” which replaced the earlier “Cramer-Rao inequality,” 
was proposed by Savage (1954). 

8.2 UMVU Estimators 

The first UMVU estimators were obtained by Aitken and Silverstone (1942) in the 
situation in which the information inequality yields the same result (Problem 5.17). 
UMVU estimators as unique unbiased functions of a suitable sufficient statistic were 
derived in special cases by Halmos (1946) and Kolmogorov (1950) and were pointed out 
as a general fact by Rao (1947). An early use of Method 1 for determining such unbiased 
estimators is due to Tweedie (1947). The concept of completeness was defined, its 
implications for unbiased estimation developed, and Theorem 1.7 obtained, in Lehmann 
and Scheffe (1950, 1955, 1956). 

Theorem 1.11 has been used to determine UMVU estimators in many special cases. 
Some applications include those of Abbey and David (1970, exponential distribution), 
Ahuja (1972, truncated Poisson), Bhattacharyya et al. (1977, censored), Bickel and 
Lehmann (1969, convex), Varde and Sathe (1969, truncated exponential). Brown and 
Cohen (1974, common mean), Downton (1973, P{X < Y)), Woodward and Kelley 
(1977, P(X < T)), Iwase (1983, inverse Gaussian), and Kremers (1986, sum-quota 
sampling). 
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Figure 8.1. Illustration of the information inequality 



8.3 Existence of Unbiased Esimators 

Doss and Sethuraman (1989) show that the process of bias reduction may not always 
be the wisest course. If an estimand g(9) does not have an unbiased estimator, and one 
tries to reduce the bias in a biased estimator S, they show that as the bias goes to zero, 
var(5) —> oo (see Problem 1.4). 

This result has implications for bias-reduction procedures such as the jackknife and 
the bootstrap. (For an introduction to the jackknife and the bootstrap, see Efron and 
Tibshirani 1993 or Shao and Tu 1995.) In particular, Efron and Tibshirani (1993, Section 
10.6) discuss some practical implications of bias reduction, where they urge caution in 
its use, as large increases in standard errors can result. 

Liu and Brown (1993) call a problem singular if there exists no unbiased estimator 
with finite variance. More precisely, if T is a family of densities, then if a problem is 
singular, there will be at least one member of T, called a singular point , where any 
unbiased estimator of a parameter (or functional) will have infinite variance. There are 
many examples of singular problems, both in parametric and nonparametric estimation, 
with nonparametric density estimation being, perhaps, the best known. Two particularly 
simple examples of singular problems are provided by Example 1.2 (estimation of 1 /p 
in a binomial problem) and Problem 5.21 (a mixture estimation problem). 


8.4 Geometry of the Information Inequality 
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The information inequality can be interpreted as, and a proof can be based on, the fact 
that the length of the hypotenuse of a right triangle exceeds the length of each side. 

For two vectors a and b, define < t, q >= t'q, with < t, t > 2 = |f| 2 . For the triangle 
in Figure 8.1, using the fact that the cosine of the angle between t and q is cos(f, q) = 
t'q/\t\\q\ and the fact that the hypotenuse is the longest side, we have 

f < f, a > "I < t,q > 

\t\ > |f|cos(f,tf) = |f| =-—-. 

L |f||?l J \q I 

If we define < X,Y > = E [(X — EX)(Y — EY )] for random variables X and F, 
applying the above inequality with this definition results in the covariance inequality 
(5.1), which, in turn, leads to the information inequality. See Fabian and Hannan (1977) 
for a rigorous development. 

8.5 Fisher Information and the Hazard Function 
Efron and Johnstone (1990) investigate an identity between the Fisher information num¬ 
ber and the hazard function, h, defined by 

_ | fe{x) 

hg{x) = lim A P(x < X < x + A|X > x) = —-- 

a^o 1 — F e (x) 

where fg and Fg are the density and distribution function of the random variable X, 

respectively. The hazard function, h(x), represents the conditional survival rate given 

survival up to time x and plays and important role in survival analysis. (See, for example, 

Kalbfleish and Prentice 1980, Cox and Oakes 1984, Fleming and Harrington 1991.) 

Efron and Johnstone show that 

/ oc a /* oo q 

— \og[f e (x)f f e {x)dx = — log [hg(x)] 2 f e (x)dx. 

-oo °9 J-co ad 

They then interpret this identity and discuss its implications to, and connections with, 
survival analysis and statistical curvature of hazard models, among other things. They 
also note that this identity can be derived as a consequence of the more general result of 
James (1986), who showed that if bf) is a continuous function of the random variable 
X, then 

var [b(X)] = E[b(X)-b(X)] 2 , where b{x) = E[b(X)\b(X) > x], 

as long as the expectations exist. 

8.6 Weak and Strong Differentiability 

Research into determining necessary and sufficient conditions for the applicability of the 
Information Inequality bound has a long history (see, for example, Blyth and Roberts 
1972, Fabian and Hannan 1977, Ibragimov and Has’minskii 1981, Section 1.7, Mtiller- 
Funket al. 1989, Brown andGajek 1990). What has resulted is a condition on the density 
sufficient to ensure (5.29). 

The precise condition needed was presented by Fabian and Hannan (1977), who call it 
weak differentiability. The function pg+&(x)/pg{x) is weakly differentiable at 9 if there 
is a measurable function q such that 


.1) fim y /j(x) | A ‘ ~ - <7(*)J Pe(.x)d/i(x) = 0 

for all /;(•) such that f h 2 (x)pg(x) d(t(x) < oo. Weak differentiability is actually equiva¬ 
lent (necessary and sufficient) to the existence of a function qg(x) such that (3/3 9)E e 8 = 


E8q. Hence, it can replace condition (5.38) in Theorem 5.15. 



146 


UNBIASEDNESS 


[ 2.S 


Since weak differentiability is often difficult to verify. Brown and Gajek (1990) intro¬ 
duce the more easily verifiable condition of strong differentiability, which implies weak 
differentiability, and thus can also replace condition (5.38) in Theorem 5.15 (Problem 
5.31). The function Po+a(x)/ p g (x) is strongly differentiable at 9 = do with derivative 
<70 O W if 


( 8 . 2 ) 


lint [ I 

\-i ( /WO) V 


V Po(x) ) _ 


q%(x)\ Ps o {x)dti(x) = 0. 


These variations of the usual definition of differentiability are well suited for the in¬ 
formation inequality problem. In fact, consider the expression in the square brackets 
in (8.1). If the limit of this expression exists, it is q g {x) = 3 log pe{x)/89. Of course, 
existence of this limit does not, by itself, imply condition (8.2); such an implication 
requires an integrability condition. 

Brown and Gajek (1990) detail a number of easier-to-check conditions that imply 
(8.2). (See Problem 5.32.) Fabian and Hannan (1977) remark that if (8.1) holds and 
3 log p g {x)/d6 exists, then it must be the case that q e (.x) = 3 log p s (x)/dS. However, 
the existence of one does not imply the existence of the other. 
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1 First Examples 

In Section 1.1, the principle of unbiasedness was introduced as an impartiality 
restriction to eliminate estimators such as S(X) = g(Oo), which would give very 
low risk for some parameter values at the expense of very high risk for others. As 
was seen in Sections 2.2-2.4, in many important situations there exists within the 
class of unbiased estimators a member that is uniformly better for any convex loss 
function than any other unbiased estimator. 

In the present chapter, we shall use symmetry considerations as the basis for 
another such impartiality restriction with a somewhat different domain of appli¬ 
cability. 

Example 1.1 Estimating binomial p. Consider n binomial trials with unknown 
probability p (0 < p < 1) of success which we wish to estimate with loss function 
L(p , d ), for example, L(p,d) = (d — p) 2 or L(p, d) = (d — p) 2 /p( 1 — p). If X t , 
i = 1..... /7 is 1 or 0 as the i th trial is a success or failure, the joint distribution of 
the X’s is 

P(x u ...,x n )= p* x ‘(l- pf (l - x ‘\ 

Suppose now that another statistician interchanges the definition of success and 
failure. For this worker, the probability of success is 

(1.1) P'=1~P 

and the indicator of success and failure on the / th trial is 

(1.2) X\ = \ -X t . 

The joint distribution of the X' is 

P{x\,---,x' n ) = p'* x ‘(\ - p'f ( '- x ‘> 

and hence satisfies 

(1.3) P(x[, .... x' n ) = P (x i, x„). 

In the new terminology, the estimated value d’ of p' is 

(1.4) d'=\-d, 

and the loss resulting from its use is L(p\ d'). The loss functions suggested at the 
beginning of the example (and, in fact, most loss functions that we would want to 
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employ in this situation) satisfy 

(1.5) L(p,d) = L(p',d'). 

Under these circumstances, the problem of estimating p with loss function L is said 
to be invariant under the transformations (1.1), (1.2), and (1.4). This invariance is 
an expression of the complete symmetry of the estimation problem with respect 
to the outcomes of success and failure. 

Suppose now that in the above situation, we had decided to use S(x), where 
x = (xi,.. ., x n ) as an estimator of p. Then, the formal identity of the primed and 
unprimed problem suggests that we should use 

(1.6) 5(x / ) = 5(l — jci.1 -x n ) 

to estimate p' = 1 — p. On the other hand, it is natural to estimate 1 — p by 1 
minus the estimator of p, i.e., by 

(1.7) l-S(x). 

It seems desirable that these two estimators should agree and hence that 

(1.8) <5(x') = 1 - 5(x). 

An estimator satisfying (1.8) will be called equivariant under the transformations 
(1.1), (1.2), and (1.4). Note that the standard estimate satisfies (1.8). 

The arguments for (1.6) and (1.7) as estimators of 1 — p are of a very different 
nature. The appropriateness of (1.6) depends entirely on the symmetry of the 
situation. It would continue to be suitable if it were known, for example, that 
5 < P < | but not if, say, \ < p < In fact, in the latter case, <5(X) would 
typically be chosen to be < \ for all X, and hence S(X') would be entirely unsuitable 
as an estimator of 1 — p, which is known to be > I . More generally, (1.6) would 
cease to be appropriate if any prior information about p is available which is not 
symmetric about \. In contrast, the argument leading to (1.7) is quite independent of 
any symmetry assumptions, but simply reflects the fact that if MX) is a reasonable 
estimator of a parameter 0 (that is, is likely to be close to 9), then 1 — 5(X) is 
reasonable as an estimator of 1 — 9. j 

We shall postpone giving a general definition of equivariance to the next section, 
and in the remainder of the present section, we formulate this concept and explore 
its implications for the special case of location problems. 

Let X = (X | ,..., X„) have joint distribution with probability density 

(1.9) /(x — £) = /(xi x n — $), -oo < % < oo, 

where / is known and % is an unknown location parameter. Suppose that for the 
problem of estimating i) with loss function L(£, d), we have found a satisfactory 
estimator <5(X). 

In analogy with the transformations (1.2) and (1.1) of the observations X, and 
the parameter p in Example 1.1, consider the transformations 

X\ = Xj + a 


( 1 . 10 ) 
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and 

( 1 . 11 ) H' = H+a. 

The joint density of X' = ( X \,..., X' n ) can be written as 

/(x'- r) = /(*!-r, ■■■,<-$') 

so that in analogy with (1.3) we have by (1.10) and (1.11) 

(1.12) fix' - f') = f(x - for all x and $. 

The estimated value d' of f is 

(1.13) d'=d + a 

and the loss resulting from its use is L(£', d'). 

In analogy with (1.5), we require L to satisfy IJf, d') = L(f, d) and hence 

(1.14) Li%+a,d + a) = Li$,d). 

A loss function L satisfies (1.14) for all values of a if and only if it depends only 
on the difference d — §, that is, it is of the form 

(1.15) L(|, d) = pid - £). 

That (1.15) implies (1.14) is obvious. The converse follows by putting a = — § in 
(1.14) and letting pid — §) = L(0, d - $). 

We can formalize these considerations in the following definition. 

Definition 1.2 A family of densities /(jc|§), with parameter §, and a loss function 
L(%,d) are location invariant if, respectively, fix '|§') = fix |£) and L(f,<i) = 
L(f', d') whenever = § + a and cl' = d + a. If both the densities and the loss 
function are location invariant, the problem of estimating f is said to be location 
invariant under the transformations (1.10), (1.11), and (1.13). 

As in Example 1.1, this invariance is an expression of symmetry. Quite generally, 
symmetry in a situation can be characterized by its lack of change under certain 
transformations. After a transformation, the situation looks exactly as it did before. 
In the present case, the transformations in question are the shifts (1.10), (1.11), and 
(1.13), which leave both the density (1.12) and the loss function (1.14) unchanged. 

Suppose now that in the original (unprimed) problem, we had decided to use 
<5(X) as an estimator of f Then, the formal identity of the primed and unprimed 
problem suggest that we should use 

(1.16) S(X') = 8iX l +a,...,X n +a) 

to estimate = £ + a. On the other hand, it is natural to estimate £ + a by adding 
a to the estimator of f, i.e., by 

(1.17) S(X) + fl. 

As before, it seems desirable that these two estimators should agree and hence that 

(1.18) SiX\ + a,..., X n + a) = SiX\,..., X n ) + a for all a. 
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Definition 1.3 An estimator satisfying (1.18) will be called equivariant under the 
transformations (1.10), (1.11), and (1.13), or location equivariant .' 

All the usual estimators of a location parameter are location equivariant. This 
is the case, for example, for the mean, the median, or any weighted average of the 
order statistics (with weights adding up to one). The MLE f is also equivariant 
since, if f maximizes /(x — £), f + a maximizes f(x — f — a). 

As was the case in Example 1.1, the arguments for (1.16) and (1.17) as estimators 
of £ + a are of a very different nature. The appropriateness of (1.16) results from 
the invariance of the situation under shift. It would not be suitable for an estimator 
of £ + a, for example, if it were known that 0<$ < 1. Then, S(X) would typically 
only take values between 0 and 1, and hence S(X') would be disastrous as an 
estimate of £ + a if a > 1. In contrast, the argument leading to (1.17) is quite 
independent of any equivariance arguments, but simply reflects the fact that if 
<5(X) is a reasonable estimator of a parameter f, then <5(X) + a is reasonable for 
estimating f + a. 

The following theorem states an important set of properties of location equiv¬ 
ariant estimators. 

Theorem 1.4 Let X be distributed with density (1.9), and let 8 be equivariant for 
estimating § with loss function (1.15). Then, the bias, risk, and variance of 8 are 
all constant (i.e., do not depend on f ). 

Proof. Note that if X has density /(x) (i.e., £ = 0), then X + £ has density (1.9). 
Thus, the bias can be written as 

b(M) = £#(X)] - | = £ 0 [<5(X + £)]-£ = £o[5(X)], 

which does not depend on £. 

The proofs for risk and variance are analogous (Problem 1.1). □ 

Theorem 1.4 has an important consequence. Since the risk of any equivariant 
estimator is independent of f the problem of uniformly minimizing the risk within 
this class of estimators is replaced by the much simpler problem of determining 
the equivariant estimator for which this constant risk is smallest. 

Definition 1.5 In a location invariant estimation problem, if a location equivariant 
estimator exists which minimizes the constant risk, it is called the minimum risk 
equivariant (MRE) estimator. 

Such an estimator will typically exist, and is often unique, although in rare 
cases there could be a sequence of estimators whose risks decrease to a value not 
assumed. To derive an explicit expression for the MRE estimator, let us begin by 
finding a representation of the most general location equivariant estimator. 

Lemma 1.6 If 8 q is any equivariant estimator, then a necessary and sufficient 
condition for 8 to be equivariant is that 

(1.19) S(x) = S 0 (x) + u(x) 

1 Some authors have called such estimators invariant, which could suggest that the estimator remains 
unchanged, rather than changing in a prescribed way. We will reserve that term for functions that 
do remain unchanged, such as those satisfying (1.20). 
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where u(x) is any function satisfying 

(1.20) u(x + a) = u(x), for all x, a. 

Proof. Assumefirstthat(1.19)and(1.20)hold.Then,5(x+a) = <$o(x+fl)+«(x+a) = 
5o(x) + a + u(x) = S(x) + a, so that S is equivariant. 

Conversely, if S is equivariant, let 

m(x) = S(x)~ S 0 (x). 

Then 


u(x + a) = S(x + a) — 8 o(x + a) 

= S(x) + a — <5o(x) — a = u(x) 

so that (1.19) and (1.20) hold. □ 

To complete the representation, we need a characterization of the functions u 
satisfying (1.20). 

Lemma 1.7 A function u satisfies (1.20) if and only if it is a function of the dif¬ 
ferences y'i = Xi — x n (i = 1— 1), n > 2; for n = 1, if and only if it is a 
constant. 

Proof. The proof is essentially the same as that of (1.15). □ 

Note that the function «(•), which is invariant, is only a function of the ancillary 
statistic (vi, ..., y n -i ) (see Section 1.6). Hence, by itself, it does not carry any 
information about the parameter f. The connection between invariance and ancil- 
larity is not coincidental. (See Lehmann and Scholz 1992, and Problems 2.11 and 
2.12.) 

Combining Lemmas 1.6 and 1.7 gives the following characterization of equiv¬ 
ariant estimators. 

Theorem 1.8 If 8 q is any equivariant estimator, then a necessary and sufficient 
condition for 8 to be equivariant is that there exists a function v ofn — 1 arguments 
for which 

(1.21) 5(x) = Sq(x) — v(y) for all x. 

Example 1.9 Location equivariant estimators based on one observation. Con¬ 
sider the case n = 1. Then, it follows from Theorem 1.8 that the only equivariant 
estimators are X + c for some constant c. j 

We are now in a position to determine the equivariant estimator with minimum 
risk. 

Theorem 1.10 Let X = (Xi,.... X n ) be distributed according to (1.9), let Y) = 
X, — X n (i = \,... ,n — Y) andY = (Y\, ..., T„_i). Suppose that the loss function 
is given by (1.15) and that there exists an equivariant estimator 8 o oft; with finite 
risk. Assume that for each y there exists a number v(y) = v*(y) which minimizes 


(1.22) 


E Q {p[8 Q (X) - v(y)]\y}. 
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Then, a location equivariant estimator S of £ with minimum risk exists and is given 
by 

a*(X) = So(X) - u*(Y). 

Proof. By Theorem 1.8, the MRE estimator is found by determining v so as to 
minimize 

W = £|{p[S 0 (X) - v(Y) - §]}. 

Since the risk is independent of §, it suffices to minimize 

R 0 (S) = £ 0 {p[3o(X) - v(Y)]} 

= j E Q {p[S Q (X)-v(y)]\y}dP Q (y). 

The integral is minimized by minimizing the integrand, and hence (1.22), for each 
y. Since Sq has finite risk £’o{p[<5o(X)]|y} < 00 ( a - e - Po). the minimization of 

(1.22) is meaningful. The result now follows from the assumptions of the theorem. 

□ 

Corollary 1.11 Under the assumptions of Theorem 1.10, suppose that p is convex 
and not monotone. Then, an MRE estimator oft; exists; it is unique if p is strictly 
convex. 

Proof. Theorems 1.10 and 1.7.15. □ 

Corollary 1.12 Under the assumptions of Theorem 1.10: 

(i) if p(d — |) = (d — f) 2 , th en 

(1.23) u*(y) = £ 0 [«o(X)|y]; 

(ii) if p(d — §) = \d — £|, then v*(y) is any median o/<5o(X) under the conditional 
distribution ofX given y. 

Proof. Examples 1.7.17 and 1.7.18 □ 

Example 1.13 Continuation of Example 1.9. For the case n = 1, if X has fi¬ 
nite risk, the arguments of Theorem 1.10 and Corollary 1.11 show that the MRE 
estimator is X — v* where i>* is any value minimizing 

(1.24) E 0 [p(X-v)]. 

In particular, the MRE estimator is X — Eq(X) and X— med () ( A) when the loss is 
squared error and absolute error, respectively. 

Suppose, now, that X is symmetrically distributed about if. Then, for any p which 
is convex and even, if follows from Corollary 1.7.19 that (1.24) is minimized by 
v = 0, so that X is MRE. Under the same assumptions, if n = 2, the MRE estimator 
is (Xi + Xt)I2. (Problem 1.3). || 

The existence of MRE estimators is, of course, not restricted to convex loss 
functions. As an important class of nonconvex loss functions, consider the case 
that p is bounded. 

Corollary 1.14 Under the assumptions of Example 1.13, suppose that 0 < p(t) < 
M for all values oft, that p(t) -> M as t —>• ±00, and that the density f of X is 
continuous a.e. Then, an MRE estimator of f exists. 
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Proof. See Problem 1.8. 

Example 1.15 MRE under 0—1 loss. Suppose that 


P(d -|) = 


if \d-%\ > k 
otherwise. 


Then, v will minimize (1.24), provided it maximizes 
(1.25) Po{|X-u| <k). 


□ 


Suppose that the density / is symmetric about 0. If / is unimodal, then v = 0 
and the MRE estimator of § is X. On the other hand, suppose that / is U-shaped, 
say f{x) is zero for x > c > k and is strictly increasing for 0 < x < c. Then, 
there are two values of v maximizing (1.25), namely v = c — k and v = —c + k , 
hence, X — c + k and X + c — k are both MRE. || 

Example 1.16 Normal. Let X \,..., X n be iid according to N(fi, a 2 ), where a 
is known. If Sq = X in Theorem 1.10, it follows from Basu’s theorem that <5 (l is 
independent of Y and hence that v(y) = v is a constant determined by minimizing 
(1.24) with X in place of X. Thus X is MRE for all convex and even p. It is also 
MRE for many nonconvex loss functions including that of Example 1.15. j 


This example has an interesting implication concerning a “least favorable” prop¬ 
erty of the normal distribution. 

Theorem 1.17 Let T be the class of all univariate distributions F that have a 
density f (w.r.t. Lebesgue measure) and fixed finite variance, say o 2 = 1. Let 
X\,... ,X n be iid with density fix, — f), % = E(X ,), and let r n (F) be the risk of 
the MRE estimator of § with squared error loss. Then, r n (F) takes on its maximum 
value over T when F is normal. 

Proof. The MRE estimator in the normal case is X with risk FAX — f) 2 = l/n. 
Since this is the risk of X, regardless of F, the MRE estimator for any other F 
must have risk < 1 /n, and this completes the proof. □ 

For« > 3, the normal distribution is, in fact, the only one for which r n (F) = l/n. 
Since the MRE estimator is unique, this will follow if the normal distribution can 
be shown to be the only one whose MRE estimator is X. From Corollary 1.12, it 
is seen that the MRE estimator is X — /T 0 [ X | Y] and, hence, is X if and only if 
£o[Y|Y] = 0. It was proved by Kagan, Linnik, and Rao (1965, 1973) that this last 
equation holds if and only if F is normal. 

Example 1.18 Exponential. Let X\,..., X„ be iid according to the exponential 
distribution E(f, b) with b known. If 3 (l = X,i, in Theorem 1.10, it again follows 
from Basu’s theorem that 3 (l is independent of Y and hence that v(y) = v is 
determined by minimizing 
(1.26) E 0 [p(X m - u)]. 

(a) If the loss is squared error, the minimizing value is v = £o[2f(i)] = b/n, and 
hence the MRE estimator is X, | , — (b/n). 

(b) If the loss is absolute error, the minimizing value is v = bilog 2)/n (Problem 
1.4). 
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(c) If the loss function is that of Example 1.15, then v is the center of the interval 
I of length 2k which maximizes P £= o[X l \ ) eI]. Since for f = 0, the density of 
is decreasing on (0, oo), v = k, and the MRE estimator is X(p — k. 

See Problem 1.5 for another comparison. i 

Example 1.19 Uniform. Let X\,..., X n be iid according to the uniform distri¬ 
bution t/(f — 1/2 b, f + 1 /2b), with b known, and suppose the loss function p is 
convex and even. For So, take [X(p + X(„)]/2 where Xp) < • • • < X(„) denote the 
ordered X's To find v(y) minimizing (1.22), consider the conditional distribution of 
<5o given y. This distribution depends on y only through the differences X (l) — X (l) , 
i =2,... ,n. By Basu’s theorem, the pair ( X lt) , X (llj ) is independent of the ratios 
Z, = [X(i) — X(i)]/X(„) — X(i)], i = 2, ..., n — 1 (Problem 1.6.36(b)). Therefore, 
the conditional distribution of So given the differences X (l) — Xp } , which is equiva¬ 
lent to the conditional distribution of So given X (nj — X (l) and the Z’s, depends only 
on X(„) — Xp,. However, the conditional distribution of So given V = X fll ) — Xp) is 
symmetric about 0 (when § = 0; Problem 1.2). It follows, therefore, as in Example 
1.13 that the MRE estimator of £ is [X ( p + X( n )]/2, the midrange. j 

When loss is squared error, the MRE estimator 
(1.27) S*(X) = S 0 (X) - E[Sq(X)|Y] 


can be evaluated more explicitly. 

Theorem 1.20 Under the assumptions of Theorem 1.15, with L(£, d) = (d — f) 2 , 
the estimator (1.27) is given by 


(1.28) 


8*(x) = 


fZo u f( x 1 - U,... ,x n - u) du 


f-oo f( x 1 U, .... X n u) du 

and in this form, it is known as the Pitman estimator of f. 

Proof. Let So(X) = X n . To compute ^(X^ly) (which exists by Problem 1.21), 
make the change of variables 


y, = Xi - x n (i = 1-- n - 1); y„ = x n . 

The Jacobian of the transformation is 1. The joint density of the Y's is therefore 

Priy t, • ■ •, y n ) = fiy \ +y n ,---, y n -i + y n . 


and the conditional density of Y„ given y = (yi,..., y„_i) is 
/(yi + y„,..., y„_i + y„, y„) 

/ /(yi + f, ..., y„_i +t, t) dt 

f tf(y l + t,...,y n _ 1 +t,t)dt 

f /(yi + t -- y„_i + t, t) dt ' 

This can be reexpressed in terms of the x's as 


It follows that 

£o[X„|y] = £ 0 [P„|y] 


£o[X„|y] = 


/ tf(x i — x„ + t...., x„-i — x„ + t, t ) dt 
/ fix 1 — x„ + t, ...,x n - 1 - X n + t, t ) dt 
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or, finally, by making the change of variables u = x„ — t as 

f uf(x i — u,... , x n — u)du 

£o[*„ly] = „ —— , 

J j (Xi — u,..., x n — u) du 

This completes the proof. □ 

Example 1.21 (Continuation of Example 1.19). As an illustration of (1.28), let 
us apply it to the situation of Example 1.19. Then 

b~ n if £ — - < A ( | ) < X (n) < £ + - 
/(*!-£,...,*„-£) = 2 2 

0 otherwise 

where b is known. The Pitman estimator is therefore given by 


nx m +b/2 / nx m +b/2 ) 

<5*(x) = / u du I / du 

Jx M -b/2 \Jx (n) -b/2 ) 


'x (n) -b/2 

which agrees with the result of Example 1.19. 


1 

2 [•*(!)+■*(«)]> 


For most densities, the integrals in (1.28) are difficult to evaluate. The following 
example illustrates the MRE estimator for one more case. 

Example 1.22 Double exponential. Let X \,..., X n be iid with double exponen¬ 
tial distribution DE (£, 1), so that their joint density is (1/2") x exp(—E|x,- — £|). 
It is enough to evaluate the integrals in (1.28) over the set where x\ < ■ ■ ■ < x n . If 
Xk < £ < x k+u 

n k 

E| x,-£| = £> i -£)-X><-£) 

k +1 1 

n k 

= xi — Xi + (2k — n)£. 
r-+i l 

The integration then leads to two sums, both in numerator and denominator of the 
Pitman estimator. The resulting expression is the desired estimator. j 


So far, the estimator 8 has been assumed to be nonrandomized. Let us now con¬ 
sider the role of randomized estimators for equivariant estimation. Recall from the 
proof of Corollary 1.7.9 that a randomized estimator can be obtained as a nonran¬ 
domized estimator <i(X. W) depending on X and an independent random variable 
W with known distribution. For such an estimator, the equivariance condition 
(1.18) becomes 

<5(X + a. W) = <$(X, W) + a for all a. 

There is no change in Theorem 1.4, and Lemma 1.6 remains valid with (1.20) 
replaced by 

u(x + a, w) = m(x, w) for all x, u>, and a. 

The proof of Lemma 1.7 shows that this condition holds if and only if u is a 
function only of y and w, so that, finally, in generalization of (1.21), an estimator 
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5(X, W) is equivariant if and only if it is of the form 

(1.29) 5(X, W) = <$ 0 (X, W) - u(Y, W). 

Applying the proof of Theorem 1.10 to (1.29), we see that the risk is minimized 
by choosing for v(y, w) the function minimizing 

£ 0 {p[<5o(X, w) - v(y, w)]|y, wj. 

Since the starting b {) can be any equivariant estimator, let it be nonrandomized, 
that is, not dependent on W. Since X and W are independent, it then follows that 
the minimizing v(y, w) will not involve w, so that the MRE estimator (if it exists) 
will be nonrandomized. 

Suppose now that T is a sufficient statistic for £. Then, X can be represented 
as (7, IT), where W has a known distribution (see Section 1.6), and any estimator 
i5(X) can be viewed as a randomized estimator based on T. The above argument 
then suggests that a MRE estimator can always be chosen to depend on T only. 
However, the argument does not apply since the family { Pj , —oo < § < oo} 
no longer needs be a location family. Let us therefore add the assumption that 
T = (T\,... ,T r ) where 7} = 7}(X) are real-valued and equivariant, that is, satisfy 

(1.30) 7}(x + a) = 7}(x) + a for all x and a. 

Under this assumption, the distributions of T do constitute a location family. To 
see this, let V = X — § so that V is distributed with density f{v \,..., v n ). Then, 
Tj(X) = Tj (V + |) = 7}(V) + q. and this defines a location family. The earlier 
argument therefore applies, and under assumption (1.30), an MRE estimator can 
be found which depends only on T. (For a general discussion of the relationship of 
invariance and sufficiency, see Hall, Wijsman, and Ghosh 1965, Basu 1969, Berk 
1972a, Landers and Rogge 1973, Arnold 1985, Kariya 1989, and Ramamoorthi 
1990.) 

In Examples 1.16,1.18 and 1.19, the sufficient statistics X. X ( i), and (X ( ij, X( n) ), 
respectively, satisfy (1.30), and the previous remark provides an alternative deriva¬ 
tion for the MRE estimators in these examples. 

It is interesting to compare the results of the present section with those on 
unbiased estimation in Chapter 2. It was found there that when a UMVU estimator 
exists, it typically minimizes the risk for all convex loss functions, but that for 
bounded loss functions not even a locally minimum risk unbiased estimator can 
be expected to exist. In contrast: 

(a) An MRE estimator typically exists not only for convex loss functions but even 
when the loss function is not so restricted. 

(b) On the other hand, even for convex loss functions, the MRE estimator often 
varies with the loss function. 

(c) Randomized estimators need not be considered in equivariant estimation since 
there are always uniformly better nonrandomized ones. 

(d) Unlike UMVU estimators which are frequently inadmissible, the Pitman es¬ 
timator is admissible under mild assumptions (Stein 1959, and Section 5.4). 
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(e) The principal area of application of UMVU estimation is that of exponential 
families, and these have little overlap with location families (see Section 1.5). 

(f) For location families, UMVU estimators typically do not exist. (For specific 
results in this direction, see Bondesson 1975.) 

Let us next consider whether MRE estimators are unbiased. 

Lemma 1.23 Let the loss function be squared error. 

(a) When 5(X) is any equivariant estimator with constant bias b, then <5(X) — b 
is equivariant, unbiased, and has smaller risk than S(X). 

(b) The unique MRE estimator is unbiased. 

(c) If a UMVU estimator exists and is equivariant, it is MRE. 

Proof. Part (a) follows from Lemma 2.2.7; (b) and (c) are immediate consequences 
of (a). □ 

That an MRE estimator need not be unbiased for general loss functions is seen 
from Example 1.18 with absolute error as loss. Some light is thrown on the possible 
failure of MRE estimators to be unbiased by considering the following decision- 
theoretic definition of unbiasedness, which depends on the loss function L. 

Definition 1.24 An estimator 8 of g(0) is said to be risk-unbiased if it satisfies 

(1.31) E g L[9, <5(X)] < E e L[d', <5(X)] foxa\\9'f9, 

If one interprets L{9, d) as measuring how far the estimated value d is from the 
estimand g(6), then (1.31) states that, on the average, 8 is at least as close to the 
true value g(0) as it is to any false value g(9 r ). 

Example 1.25 Mean-unbiasedness. If the loss function is squared error, (1.31) 
becomes 

(1.32) E e [8(X) - g(9')f > E e [S(X) - g(tf)] 2 for all 9' f9. 

Suppose that Eg(8 2 ) < oo and that E g (8) e f2 g for all 9, where f2 g = {g(9) : 9 e 
f2}. [The latter condition is, of course, automatically satisfied when Q = (—oo, oo) 
and g(9) = 9, as is the case when 9 is a location parameter.] Then, the left side 
of (1.32) is minimized by g(9 r ) = Eg8(X ) (Example 1.7.17) and the condition of 
risk-unbiasedness, therefore, reduces to the usual unbiasedness condition 

(133) E e 8(X) = g(9). 


Example 1.26 Median-unbiasedness. If the loss function is absolute error, (1.31) 
becomes 

(1.34) E e \8(X) - g(9')\ > E e \8(X) - g(9)\ for all 9'f9. 

By Example 1.7.18, the left side of (1.34) is minimized by any median of 8{X). It 
follows that (1.34) reduces to the condition 


(1.35) 


med 0 <5(X) = g(9), 
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that is, g(9) is a median of S(X), provided Eg\S\ < oo and £2 g contains a median 
of <5(X) for all 9. An estimator S satisfying (1.35) is called median-unbiased. || 

Theorem 1.27 If 8 is MRE for estimating f in model (1.9) with loss function 
(1.15), then it is risk-unbiased. 

Proof. Condition (1.31) now becomes 

E^p[8(X) - n > E^[8(X) - $] for all 

or, if without loss of generality we put § = 0, 

£op[5(X) — a] > £op[<5(X)] for all a. 

□ 

That this holds is an immediate consequence of the fact that <5(X) = <5o(X) — u*(Y) 
where u*(y) minimizes (1.22). 

2 The Principle of Equivariance 

In the present section, we shall extend the invariance considerations of the bi¬ 
nomial situation of Example 1.1 and the location families (1.9) to the general 
situation in which the probability model remains invariant under a suitable group 
of transformations. 

Let X be a random observable taking on values in a sample space X according 
to a probability distribution from the family 

(2.1) V = {P 0 ,9eQ}. 

Denote by C a class of 1: 1 transformations g of the sample space onto itself. 

Definition 2.1 

(i) If g is a 1: 1 transformation of the sample space onto itself, if for each 9 
the distribution of X' = gX is again a member of V, say If, and if as 9 
traverses £2, so does 9' , then the probability model (2.1) is invariant under the 
transformation g. 

(ii) If (i) holds for each member of a class of transformations C, then the model 
(2.1) is invariant under C. 

A class of transformations that leave a probability model invariant can always 
be assumed to be a group. To see this, let G = G(C ) be the set of all compositions 
(defined in Section 1.4) of a finite number of transformations g, 1 ■ • • g* 1 with 
gi ,..., g m e C, where each of the exponents can be +1 or —1 and where the 
elements g\,..., g,„ need not be distinct. Then, any element g e G leaves (2.1) 
invariant, and G is a group (Problem 2.1), the group generated by C. 

Example 2.2 Location family. 

(a) Consider the location family (1.9) and the group of transformations X' = X+<7, 
which was already discussed in (1.10) and Example 4.1. It is seen from (1.12) 
that if X is distributed according to (1.9) with 9 = £, then X' = X + a has the 
density (1.9) with O' = %' = %+ a, so that the model (1.9) is preserved under 
these transformations. 
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(b) Suppose now that, in addition, / has the symmetry property 

(2.2) /(-x) = /(x) 

where — x = (— x \,..., —x n ), and consider the transformation x' = — x. The 
density of X' is 

f(-x\ f> = f(x\ X n - %') 

if fThus, model (1.9) is invariant under the transformations x' = — x, 
and hence under the group consisting of this transformation and the 
identity (Problem 2.2). This is not true, however, if / does not satisfy (1.10). 
If, for example, X\,... ,X„ are iid according to the exponential distribution 
E(%, 1), then the variables — X\,..., — X n no longer have an exponential 
distribution. j 

Let [gX, g e G} be a group of transformations of the sample space which leave 
the model invariant. If gX has the distribution /V, then 6' = g6 is a function which 
maps Q onto L>, and the transformation gO is 1: 1, provided the distributions Pg , 
6 e £2 are distinct (Problem 2.3). It is easy to see that the transformations g then 
also form a group which will be denoted by G (Problem 2.4). From the definition 
of g6, it follows that 

(2.3) P e (gX e A) = P- ge (X e A) 

where the subscript on the left side indicates the distribution of X, not that of gX. 
More generally, for a function i// whose expectation is defined, 

(2.4) EgW(gX) ] = E- g0 mX)]. 

We have now generalized the transformations (1.10) and (1.11), and it remains to 
consider (1.13). This last generalization is most easily introduced by an example. 

Example 2.3 Two-sample location family. Let X = (X\., X m ) and Y = 

(Y), ..., Y„) and suppose that (X, Y) has the joint density 

(2.5) /(x - % , y - rf) = f(x i x m - £, yi y„ - r]). 

This model remains invariant under the transformations 

(2.6) g(x, y) = (x + a.y + b), g($, rf = ($ + a, rj + b). 

Consider the problem of estimating 

(2.7) A = !?-£. 

If the transformed variables are denoted by 

x' = x + a, y'= y + b, |' = f+a, r]’ = r) + b, 

then A is transformed into A' = A + (b — a). Hence, an estimated value cl. when 
expressed in the new coordinates, becomes 

d' = d + (b — a). 


( 2 . 8 ) 
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For the problem to remain invariant, we require, analogously to (1.14), that the 
loss function L(§, iy,d) satisfies 

(2.9) L[£ + a, j] + b; d + (b — a)] = L(§, 17 ; d). 

It is easy to see (Problem 2.5) that this is the case if and only if L depends only on 
the difference (?? — §) — d, that is, if 

( 2 . 10 ) L(f, ,yd) = p(A-d). 

Suppose, next, that instead of estimating — £, the problem is that of estimating 

/»($, n ) = f 2 + ri 2 . 

Under the transformations (2.6), /?(§, /;) is transformed into (§ + a) 2 + (rj+b) 2 . This 
does not lead to an analog of ( 2 . 8 ) since the transformed value does not depend on 
(£, rj) only though /;(£, ;;). Thus, the form of the function to be estimated plays a 
crucial role in invariance considerations. 


Now, consider the general problem of estimating h(9) in model (2.1), which is 
assumed to be invariant under the transformations X' = gX, 6' = gO, g e G. The 
additional assumption required is that for any given g, h(g0) depends on 0 only 
through h{6), that is, 

( 2 . 11 ) h(0i) = h( 02 ) implies h{gO x ) = h(g0 2 ). 

The common value of b(g9) for all 0’s to which h assigns the same value will then 
be denoted by 

(2.12) h(gd) = g*h(0). 

If H is the set of values taken on by h{0) as 6 ranges over f2, the transforma¬ 
tions g* are 1: 1 from di onto itself. [Problem 2.8(a)]. As g ranges over G, the 
transformations g* form a group G* (Problem 2.6). 

The estimated value d of h(9 ) when expressed in the new coordinates becomes 

(2.13) d! = g*d. 

Since the problems of estimating h(9) in terms of ( X , 9, d) or h(9') in terms of 
(X',9', 1 i') represent the same physical situation expressed in a new coordinate 
system, the loss function should satisfy L(9 r , d') = L(9, d). 

This leads to the following definition. 

Definition 2.4 If the probability model (2.1) is invariant under g, the loss function 
L satisfies 

(2.14) L{g0, g*d) = L{9, d), 

and h{9) satisfies (2.11), the problem of estimating MO) with loss function L is 
invariant under g. 

In this discussion, it was tacitly assumed that the set T> of possible decisions 
coincides with 'H. This need not, however, be the case. In Chapter 2, for example, 
estimators of a variance were permitted (with some misgiving) to take on negative 
values. In the more general case that H is a subset of T>, one can take the condition 



3.2] 


THE PRINCIPLE OF EQUIVARIANCE 


161 


that (2.14) holds for all 9 as the definition of g*d. If L(9, d ) = L{9, d') for all 
9 implies d = d', as is typically the case, g*d is uniquely defined by the above 
condition, and g* is 1: 1 from V onto itself [Problem 2.8(b)]. 

In an invariant estimation problem, if 8 is the estimator that we would like to 
use to estimate h{9), there are two natural ways of estimating g*h(9 ), the estimand 
h(9) expressed in the transformed system. One of these generalizes the estimators 
(1.6) and (1.16), and the other the estimators (1.6) and (1.17) of the preceding 
section. 

1. Functional Equivariance. Quite generally, if we have decided to use S(X) to 
estimate h(9), it is natural to use 

<p[8(X)\ as the estimator of <p[h(9)], 

for any function cp. If, for example, 8(X) is used to estimate the length 9 of the edge 
of a cube, it is natural to estimate the volume 0 3 of the cube by [<S(X)] 3 . Hence, if 
d is the estimated value of h(9), then g*d should be the estimated value of g*h(9). 
Applying this to </> = g* leads to 

(2.15) g*(5(2Q as the estimator of g*h(9) 
when <$(X) is used to estimate h{9). 

2. Formal Invariance. Invariance under transformations g, g, and g* of the esti¬ 
mation of h{9) means that the problem of estimating h(9) in terms of X, 9, and 
d and that of estimating g*h(9) in terms of X', 9\ and d’ are formally the same, 
and should therefore be treated the same. In generalization of (1.6) and (1.16), this 
means that we should use 

(2.16) 8(X') = 8(gX) to estimate g*[h(6)] = h(g9). 

It seems desirable that these two principles should lead to the same estimator 
and hence that 

(2.17) S(gX) = g*S(X). 

Definition 2.5 In an invariant estimation problem, an estimator 8(X) is said to be 
equivariant if it satisfies (2.17) for all g e G. 

As was discussed in Section 1, the arguments for (2.15) and (2.16) are of a 
very different nature. The appropriateness of (2.16) results from the symmetries 
exhibited by the situation and represented mathematically by the invariance of the 
problem under the transformations g e G. It gives expression to the idea that if 
some symmetries are present in an estimation problem, the estimators should pos¬ 
sess the corresponding symmetries. It follows that (2.16) is no longer appropriate 
if the symmetry is invalidated by asymmetric prior information; if, for example, 9 
is known to be restricted to a subset oo of the parameter space f2, for which geo oo, 
as was the case mentioned at the end of Example 1.1.1 and after Definition 1.3. 
In contrast, the argument leading to (2.15) is quite independent of any symmetry 
assumptions and simply reflects the fact that if 8(X) is a reasonable estimator of, 
say, 9 then (p[8(X)] is a reasonable estimator of <p{9). 
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Example 2.6 Continuation of Example 2.3. In Example 2.3, /;(£, r]) = rj — f, 

and by (2.8), g*d = d + (b — a). It follows that (2.17) becomes 

(2.18) S(x + a, y + b) = <5(x, y) + b — a. 

If <5o(X) and <5q(Y) are location equivariant estimators of £ and respectively, then 
<5(X, Y) = <5g(Y) — <$o(X) is an equivariant estimator of q — £. i 

The following theorem generalizes Theorem 1.4 to the present situation. 

Theorem 2.7 If S is an equivariant estimator in a problem which is invariant 
under a transformation g, then the risk function of 8 satisfies 

(2.19) R(g9,8)= R(6,8) for all 9. 

Proof By definition 

R(g9,8) = E- ge L[gO,S(X)]. 

It follows from (2.4) that the right side is equal to 

E e L[g9, 5(gX)] = E e L[g6 , g*5(X)] = R(9, S). 

a 

Looking back on Section 1, we see that the crucial fact underlying the success of 
the invariance approach was the constancy of the risk function of any equivariant 
estimator. Theorem 2.7 suggests the following simple condition for this property 
to obtain. 

A group G of transformations of a space is said to be transitive if for any two 
points there is a transformation in G taking the first point into the second. 

Corollary 2.8 Under the assumptions of Theorem 2.7, if G is transitive over the 
parameter space fL then the risk function of any equivariant estimator is constant, 
that is, independent of 6. 

When the risk function of every equivariant estimator is constant, the best equiv¬ 
ariant estimator (MRE) is obtained by minimizing that constant, so that a uniformly 
minimum risk equivariant estimator will then typically exist. In such problems, 
alternative characterizations of the best equivariant estimator can be obtained. 
(See Problems 2.11 and 2.12.) Berk (1967a) and Kariya (1989) provide a rigorous 
treatment, taking account of the associated measurability problems. A Bayesian 
approach to the derivation of best equivariant estimators is treated in Section 4.4. 

Example 2.9 Conclusion of Example 2.3. In this example, 6 = ($, q) and g6 = 

(f + a, q + /;). This group of transformations is transitive over Q since, given any 
two points (£, i]) and (f', f), a and b exist such that £ + a = and q + b = if. The 
MRE estimator can now be obtained in exact analogy to Section 3.1 (Problems 
1.13 and 1.14). || 

The estimation problem treated in Section 1 was greatly simplified by the fact 
that it was possible to dispense with randomized estimators. The corresponding 
result holds quite generally when G is transitive. If an estimator <5 exists which is 
MRE among all nonrandomized estimators, it is then also MRE when randomiza¬ 
tion is permitted. To see this, note that a randomized estimator can be represented 
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as S'(X, W) where W is independent of X and has a known distribution and that 
it is equivariant if S'(gX, W) = g*S'(X, W). Its risk is again constant, and for any 
0 = 0q, it is equal to E[h(Wf\ where 

h(w) = Eg 0 {L[S'(X, w ), 0 O ]}- 

This risk is minimized by minimizing h(w) for each w. However, by assumption, 
S'(X, w) = <5(.Y) minimizes h(w), and hence the MREestimatorcan he chosen to he 
nonrandomized. The corresponding result need not hold when G is not transitive. 
A counterexample is given in Example 5.1.8. 

Definition 2.10 For a group Q of transformations of Q, two points 0 \, (h <= £5! 
are equivalent if there exists a g e Q such that g9\ = 6 2 . The totality of points 
equivalent to a given point (and hence to each other) is called an orbit of Q. The 
group Q is transitive over Q if it has only one orbit. 

For the most part, we will consider transitive groups; however, there are some 
groups of interest that are not transitive. 

Example 2.11 Binomial transformation group. Fet X ~ binomial(n, p), 0 < 
p < 1, and consider the group of transformations. 

gX = n — X, 
gP = 1 — P- 

The orbits are the pairs (p , 1 — p). The group is not transitive. | 

Example 2.12 Orbits of a scale group. Fet X \,..., X n be iid AT/r, a 2 ), both 
unknown, and consider estimation of a 2 . The model remains invariant under the 
scale group 

gXj = aXi , 

gigt, o 2 ) = (apt, a 2 o 2 ), a > 0. 

We shall now show that (hi, ctf) and (pi, ch) lie on the same orbit if and only if 
Mi/o’i = P-2/&2- 

On the one hand, suppose that n\/o\ = 12 , 2 / 02 - Then, H 2 IP 1 = o^/eri = a , say, 
and p-2 = apt 1; op = a 2 of. On the other hand, if pt2 = apt\ and op = a 2 of, then 
P2/P\ = a and of /of = a. Thus, the values of r = pt/o can be used to label the 
orbits oft?. j 

The following corollary is a straightforward consequence of Theorem 2.7. 

Corollary 2.13 Under the assumptions of Theorem 2.7, the risk function of any 
equivariant estimator is constant on the orbits ofQ. 

Proof See Problem 2.15. □ 

In Section 1.4, group families were introduced as families of distributions gen¬ 
erated by subjecting a random variable with a fixed distribution to a group of 
transformations. Consider now a family of distributions V = {Pg, 6 e f2} which 
remains invariant under a group G for which G is transitive over Q and g\ f g 2 
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implies g\ g 2 . Let 9q be any fixed element of Q. Then V is exactly the group 
family of distributions of {gX, g e G] when X has distribution I\. 

Conversely, let V be the group family of the distributions of gX as g varies 
over G, when X has a fixed distribution P, so that V = { P„, g e G }. Then, g can 
serve as the parameter 0 and G as the parameter space. In this notation, the starting 
distribution P becomes P e , where e is the identity transformation. Thus, a family 
of distributions remains invariant under a transitive group of transformations of 
the sample space if and only if it is a group family. 

When an estimation problem is invariant under a group of transformations and 
an MRE estimator exists, this seems the natural estimator to use—of the various 
principles we shall consider, equivariance, where it applies, is perhaps the most 
convincing. Yet, even this principle can run into difficulties. The following example 
illustrates the possibility of a problem remaining invariant under two different 
groups, G i and G 2 , which lead to two different MRE estimators Si and 52- 


Example 2.14 Counterexample. Let the pairs (Yi, X 2 ) and (Y \. T 2 ) be indepen¬ 
dent, each with a bivariate normal distribution with mean zero. Let their covariance 
matrices be E = [<r, ; ] and AE = | Aa, ; ], A > 0, and consider the problem of es¬ 
timating A. 

Let Gi be the group of transformations 


( 2 . 20 ) 


Yj— ci\X\ + u 2 X 2 T[ — c(g 1 Yj + tt 2 Y 2) 

X' 2 = bX 2 Y 2 = cbY 2 . 


Then, (Y,, X' 0 ) and (Y \, Y 2 ) will again be independent and each will have a bivariate 
normal distribution with zero mean. If the covariance matrix of (X[, X 2 ) is E', that 
of ( Y \, K 0 'j is A'E' where A' = c 2 A (Problem 2.16). Thus, Gi leaves the model 
invariant. 

If /7(E, A) = A, (2.11) clearly holds, (2.12) and (2.13) become 
(2.21) A' = c 2 A, d' = c 2 d, 


respectively, and a loss function L(A,d) satisfies (2.14) provided L(c 2 A, c 2 d) = 
L( A, d). This condition holds if and only if L is of the form 


( 2 . 22 ) 


L(A, d) = p(d/ A). 


[For the necessity of (2.22), see Problem 2.10.] 

An estimator 8 of A is equivariant under the above transformation if 

(2.23) S(x\ yO = c 2 S(x, y). 

We shall now show that (2.23) holds if and only if 


(2.24) 


S(x, y) = 


ky 2 _ 


for some value of k a.e. 


It is enough to prove this for the reduced sample space in which the matrix 


X\X 2 

yiyi 


is nonsingular and in which both x 2 and V 2 are ^ 0, since the rest of the sample 
space has probability zero. 
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Let G\ be the subgroup of G i consisting of the transformations (2.20) with 
b = c = 1. The condition of equivariance under these transformations reduces to 

(2.25) S(x', y') = S(x, y). 

This is satisfied whenever 8 depends only on X 2 and y 2 since x' 2 = X 2 and y' 2 = y 2 - 
To see that this condition is also necessary for (2.25), suppose that <5 satisfies (2.25) 
and let {x[ , x 2 \ y[ , yi) and (ai , X 2 ; yi , >’ 2 ) be any two points in the reduced sample 
space which have the same second coordinates. Then, there exist a 1 and a 2 such 
that 

x\ = a\X\ + a 2 X2\ y\ = fli.Vi + a 2 y 2 , 

that is, there exists g e G\ for which g(x, y) = (x'y'), and hence 8 depends only 

on X 2 , }’ 2 - 

Consider now any S'(x 2 , >’ 2 ). To be equivariant under the full group G \ , 8' must 
satisfy 

(2.26) 8'(bx 2 , cby 2 ) = c 2 8'(x 2 , >’ 2 ). 

For x 2 = y 2 = 1, this condition becomes 

8\b,cb) = c 2 8\ 1, 1) 

and hence reduces to (2.24) with x 2 = b, y 2 = be , and k = S'( 1, 1). This shows that 
(2.24) is necessary for 8 to be equivariant; that it is sufficient is obvious. 

The best equivariant estimator under G \ is thus k* Yy / X\ where k* is a value 
which minimizes 



Such a minimizing value will typically exist. Suppose, for example, that the loss 
is 1 if |d — A|/A >1/2 and zero otherwise. Then, k* is obtained by maximizing 



As k -» 0 or 00 , this probability tends to zero, and a maximizing value therefore 
exists and can be determined from the distribution of Yy/Xy when A = 1. 

Exactly the same argument applies if G\ is replaced by the transformations G 2 

X[ = bX x Y[ = cbY\ 

X'i = ci\ X\ + u 2 X 2 Y’ 2 — c(fl 1 Y\ + q 2 YY) 

and leads to the MRE estimator k*Y 2 /X 2 . See Problems 2.19 and 2.20. || 

In the location case, it turned out (Theorem 1.27) that an MRE estimator is 
always risk-unbiased. The extension of this result to the general case requires 
some assumptions. 

Theorem 2.15 If G is transitive and G* commutative, then an MRE estimator is 
risk-unbiased. 

Proof. Let 8 be MRE and 0. O' e f2. Then, by the transitivity of G, there exists 
g e G such that 0 = gO' , and hence 

EeW, 5(A)] = E e L[g~ l 0, 3(A)] = E e L[0, g*S(X)]. 
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Now, if 8(X) is equivariant, so is g*<5(20 (Problem 2.18), and, therefore, since S is 
MRE, 

E e L(0,g*S(X)] > E e L(0,8(X )], 

which completes the proof. □ 

Transitivity of G will usually [but not always, see Example 2.14(a) below] hold 
when an MRE estimator exists. On the other hand, commutativity of G* imposes 
a severe restriction. That the theorem need not be valid if either condition fails is 
shown by the following example. 

Example 2.16 Counterexample. Let X be (V(£, a 2 ) with both parameters un¬ 
known, let the estimand be £ and the loss function be 

(2.27) L(^,a\d) = (d-^f/a 2 . 

(a) The problem remains invariant under the group G i; gx = x + c. It follows 
from Section 1 that X is MRE under G\. However, X is not risk-unbiased 
(Problem 2.19). Here, G \ is the group of transformations 

g(Z, o') = (£ + c, ct), 

which is clearly not transitive. 

If the loss function is replaced by (d — £) 2 , the problem will remain invariant 
under Gx, X remains equivariant but is now risk-unbiased by Example 1.25. 
Transitivity of G is thus not necessary for the conclusion of Theorem 2.15. 

(b) When the loss function is given by (2.27), the problem also remains invariant 

under the larger group G 2 : ax+c, 0 < a. Since X is equivariant under Go and 
MRE under Gi, it is also MRE under G 2 . However, as stated in (i), X is not 
risk-unbiased with respect to (1.35). Here, G* is the group of transformations 
g*d = ad + c, and this is not commutative (Problem 2.19). j 

The location problem considered in Section 1 provides an important example 
in which the assumptions of Theorem 2.15 are satisfied, and Theorem 1.27 is the 
specialization of Theorem 2.15 to that case. The scale problem, which will be 
considered in Section 3, can also provide another illustration. 

We shall not attempt to generalize to the present setting the characterization of 
equivariant estimators which was obtained for the location case in Theorem 1.8. 
Some results in this direction, taking account also of the associated measurability 
problems, can be found in Eaton (1989) or Wijsman (1990). Instead, we shall 
consider in the next section some other extensions of the problem treated in Section 
1. 

We close this section by exhibiting a family V of distributions for which there 
exists no group leaving V invariant (except the trivial group consisting of the 
identity only). 

Theorem 2.17 Let X be distributed according to the power series distribution 
[see (2.3.9)] 

(2.28) P(X = k) = c k 0 k h(6); k = 0, 1,..., 0 < 6> < 00 . 
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If Ck > 0 for all k, then there does not exist a transformation gx = g(x) leaving 
the family (2.28) invariant except the identity transformation g(x) = x for all x. 

Proof Suppose Y = g(X) is a transformation leaving (2.28) invariant, and let 
g(k) = ak and gO = p. Then, Pg(X = k) = P^(Y = af) and hence 

(2.29) c k e k h(0) = c ak /i ak h(n). 

Replacing k by k + 1 and dividing the resulting equation by (2.29), we see that 

(2.30) —6 = — /x a * +1 “ at . 

Ck C ak 

Replacing k by k + 1 in (2.30) and dividing the resulting equation by (2.30) shows 
that 

^a k+2 -a k+l j s proportional to p at+l ~“ k for all 0 < /x < m 
and hence that 


«jfc+2 — «/t+l - tfjfc+1 — <7/1 - 
If we denote this common value by A, we get 


(2.31) ak=ao + kA for = 0,1,2,.... 


Invariance of the model requires the set (2.31) to be a permutation of the set 
{0, 1,2,...}. This implies that A > 0 and hence that a o = 0 and A = 1, i.e., that 
fli = k and g is the identity. □ 

Example 2.11 shows that this result no longer holds if q = 0 for k exceeding 
some see Problem 2.28. 


3 Location-Scale Families 


The location model discussed in Section 1 provides a good introduction to the ideas 
of equivariance, but it is rarely realistic. Even when it is reasonable to assume the 
form of the density / in (1.9) to be known, it is usually desirable to allow the model 
to contain an unknown scale parameter. The standard normal model according to 
which X\,... ,X„ are iid as N(f, cr 2 ) is the most common example of such a 
location-scale model. In this section, we apply some of the general principles 
developed in Section 2 to location-scale models, as well as some other group 
models. As preparation for the analysis of these models, we begin with the case, 
which is of interest also in its own right, in which the only unknown parameter is 
scale parameter. 

Let X = (X |,..., X„) have a joint probability density 


(3.1) 




r > 0, 


where / is known and r is an unknown scale parameter. This model remains 
invariant under the transformations 


(3.2) 


X' i = bXj, r' = hr for b > 0. 
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The estimand of primary interest is h(r) = r'. Since h is strictly monotone, 
(2.11) is vacuously satisfied. Transformations (3.2) induce the transformations 

(3.3) h( t) -* b r x r = b'h(r) and d' = b'cl, 

and the loss function L is invariant under these transformations, provided 

(3.4) L(br, b r d) = L( r, d). 

This is the case if and only if it is of the form (Problem 3.1) 

(3.5) L(x,d) = Y [pj- 
Examples are 

(d — r r ) 2 \d — T r \ 

(3.6) L( r, d) = -—- and L( r, d) = -- 

X lr T r 

but not squared error. 

An estimator <5 of r' is equivariant under (3.2), or scale equivariant, provided 

(3.7) 8(bX) = l/8(X). 


All the usual estimators of r are scale equivariant; for example, the standard devi¬ 
ation J £ (Xj — X)" / (n — 1), the mean deviation £|X,- — X\/n, the range, and 
the maximum likelihood estimator [Problem 3.1(b)], 

Since the group G of transformations r' = br, b > 0, is transitive over Q, 
the risk of any equivariant estimator is constant by Corollary 2.8, so that one can 
expect an MRE estimator to exist. To derive it, we first characterize the totality of 
equivariant estimators. 

Theorem 3.1 Let X have density (3.1) and let 5o(X) be any scale equivariant 
estimator of z r . Then, if 

(3.8) Zi = — (i = 1,..., n — 1) and z n =-~- 

x n \x n \ 


and if z = (zi, . ■., Z n ), a necessary and sufficient condition for 8 to satisfy (3.7) is 
that there exists a function w( z) such that 


S(x) = 


£o(x) 

w( z) 


Proof. Analogous to Lemma 1.6, a necessary and sufficient condition for 8 to 
satisfy (3.7) is that it is of the form 5(x) = 8 q(x)/u(x) where (Problem 3.4) 


(3.9) u(bx) = u(x) for all x and all b > 0. 


It remains to show that (3.9) holds if and only if u depends on x only through z. 
Note here that z is defined when x n f 0 and, hence, with probability 1. That any 
function of z satisfies (3.9) is obvious. Conversely, if (3.9) holds, then 


u(x i,..., x n ) = u 




hence, u does depend only on z, as was to be proved. 


□ 
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Example 3.2 Scale equivariant estimator based on one observation. Suppose 
that n = 1. Then, the most general estimator satisfying (3.7) is of the form X r /w(Z) 
where Z = X/\X\ is ±1 as X is > or < 0, so that 


S(X) = 


AX r 

BX' 


if X > 0 
if X < 0, 


A , B being two arbitrary constants. 


Let us now determine the MRE estimator for a general scale family. 

Theorem 3.3 Let X be distributed according to (3.1) and let Z be given by (3.8). 
Suppose that the loss function is given by (3.5) and that there exists an equivariant 
estimator So of x r with finite risk. Assume that for each z, there exists a number 
w(z) = w*( z) which minimizes 


(3.10) E l {y[S Q (X)/w(z)]\z}. 


Then, an MRE estimator S* oft' exists and is given by 


(3.11) 


<5*(X) 


jog) 

w*(X )' 


The proof parallels that of Theorem 1.10. 

Corollary 3.4 Under the assumptions of Theorem 3.3, suppose that p(v) = y(e v ) 
is convex and not monotone. Then, an MRE estimator of x r exists; it is unique if p 
is strictly convex. 


Proof. By replacing y (w) by p(log w ) [with p(—o o) = y (0)], the result essentially 
reduces to that of Corollary 1.11. This argument requires that S > 0, which can be 
assumed without loss of generality (Problem 3.2). □ 


Example 3.5 Standardized power loss. Consider the loss function 


(3.12) 


L(x, d) ■ 


\d — x 


r I P 


X Pr 


d -- 1 
X r 


with y(v) = |i> — \\ p . Then, p is strictly convex for v > 0, provided p > 1 
(Problem 3.5). Under the assumptions of Theorem 3.3, if we set 


(3.13) 



(d - O 2 


then (Problem 3.10) 

(3.14) 


3o(X)£ 1 [3q(X)|Z] j 
£i[5 0 2 (X)|Z] ’ 


if 

(3.15) 


Y 


| d 


then <5*(X) is given by (3.11), with w*(Z) any scale median of 5o(X) under the 
conditional distribution of X given Z and with x = 1, that is, w*( z) satisfies 


(3.16) E(X\Z)I(X > w*( Z)) = E(X\Z)I(X < w*( Z)) 


(Problems 3.7 and 3.10). 



170 


EQUIVARIANCE 


[3.3 


Example 3.6 Continuation of Example 3.2. Suppose that n = 1, and X > 0 

with probability 1. Then, the arguments of Theorem 3.3 and Example 3.5 show 
that if X r has finite risk, the MRE estimator of x r is X r /w* where w* is any value 
minimizing 

(3.17) EdY(X r /u ;)]. 

In particular, the MRE estimator is 

(3.18) X r EfX r )/E x (X lr ) 

when the loss is (3.13), and it is X r /w*, where w* is any scale median of X r for 
r = 1, when the loss is (3.15). j 


Example 3.7 MRE for normal variance, known mean. Let X\, ..., X„ be iid 

according to ATO. a 2 ) and consider the estimation of a 2 . For 5o = E Xj, it follows 
from Basu’s theorem that 5o is independent of Z and hence that w*(z) = w* is 
a constant determined by minimizing (3.17) with 'EX 2 in place of X r . For the 
loss function (3.13) with r = 2, the MRE estimator turns out to be E Xj/(n + 2) 
[Equation (2.2.26) or Problem 3.7], || 


Quite generally, when the loss function is (3.13), the MRE estimator of r' is 
given by 


(3.19) 


/“ v" + '' 1 f(vx\,..., vx n )dv 
/ 0 °° v n+2r ~ I f(vx \,..., vx n )dv 


and in this form, it is known as the Pitman estimator of r'. The proof parallels that 
of Theorem 1.20 (Problem 3.16). 

The loss function (3.13) satisfies 


lim L(t, d) = oo but lim L( r, d) = 1, 

d—> oo d^-0 

so that it assigns much heavier penalties to overestimation than to underestimation. 
An alternative to the loss function (3.13) and (3.15), first introduced by Stein (James 
and Stein, 1961), and known as Stein’s loss, is given by 

(3.20) L s ( r, d) = (d/r') - log(t//r r ) - 1. 

For this loss, lim^oo L s ( r, d) = lim^o L s (r, d) = oo; it is thus somewhat more 
evenhanded. For another justification of (3.20), see Brown 1968, 1990b and also 
Dey and Srinivasan 1985. 

The change in the estimator (3.14) if (3.13) is replaced by (3.20) is shown in the 
following corollary. 

Corollary 3.8 Under the assumptions of Theorem 3.3, if the loss function is given 
by (3.20), the MRE estimator 8* oft' is uniquely given by 

(3.21) 8* = 8 0 (X)/E l (8 0 (X)\z). 

Proof. Problem 3.19. □ 

In light of the above discussion about skewness of the loss function, it is inter¬ 
esting to compare <5* of (3.21) with 8* of (3.14). It is clear that 8* > 8* if and only 
if £|(<5 q(X)|Z) > [£i(<5o(X)|Z)] 2 , which will always be the case. Thus, L s results 
in an estimator which is larger. 
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Example 3.9 Normal scale estimation under Stein’s loss. For the situation of 
Example 3.7, with r = 2, the MLE is <5*(x) = EX 2 //? which is always larger than 
S* = 'EXf/(n + 2), the MRE estimator under Lt(t. d). Brown (1968) explores the 
loss function L s further, and shows that it is the only scale invariant loss function 
for which the UMVU estimator is also the MRE estimator. j 

So far, the estimator S has been assumed to be nonrandomized. Since G is 
transitive over Q, it follows from the result proved in the preceding section that 
randomized estimators need not be considered. It is further seen, as for the cor¬ 
responding result in the location case, that if a sufficient statistic T exists which 
permits a representation T = (T),..., T r ) with 

Ti(bX) = bTj(X) for all b > 0, 

then an MRE estimator can be found which depends only on T. Illustrations are 
provided by Example 3.7 and Problem 3.12, with T = (EX, 2 ) 1 / 2 and T = X in] , 
respectively. When the loss function is (3.13), it follows from the factorization 
criterion that the MRE estimator (3.19) depends only on T. 

Since the group r' = bx, b > 0, is transitive and the group d' = r' d is commuta¬ 
tive, Theorem 3.3 applies and an MRE estimator is always risk-unbiased, although 
the MRE estimators of Examples 3.7 and 3.9 are not unbiased in the sense of 
Chapter 2. See also Problem 3.12. 

Example 3.10 Risk-unbiasedness. If the loss function is (3.13), the condition of 
risk-unbiasedness reduces to 

(3.22) £ t [S 2 (X)] = x r E r [<$(X)]. 

Given any scale equivariant estimator <5o(X) of r'\ there exists a value of c for 
which cSo(X) satisfies (3.22), and for this value, c<5o(X) has uniformly smaller risk 
than 5o(X) unless c = 1 (Problem 3.21). 

If the loss function is (3.15), the condition of risk-unbiasedness requires that 
£ t |<5(X) — a\/a be minimized by a = x r . From Example 3.5, for this loss function, 
risk-unbiasedness is equivalent to the condition that the estimand x r is equal to the 
scale median of <5(X). | 


Let us now turn to location-scale families, where the density of X = (Xi, ..., X n ) 
is given by 


(3.23) 


1 

r n 


M - £ 




r r 

with both parameters unknown. Consider first the estimation of r r with loss func¬ 
tion (3.5). This problem remains invariant under the transformations 

(3.24) X[ = a + bXi, £' = a + b%, x’ = bx (b > 0), 

and d' = b’ d, and an estimator S of t' is equivariant under this group if 

(3.25) S(a + bX ) = b r S(X). 


Consider first only a change in location, 

(3.26) X' = Xi+a, 
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which takes £ into f' = £ + a but leaves r unchanged. By (3.25), S must then satisfy 
(3.27) S(x + a) = <5(x), 


that is, remain invariant. By Lemma 1.7, condition (3.27) holds if and only if S is 
a function only of the differences y, = x, — x„. The joint density of the F’s is 


(3.28) 



Since this density has the structure (3.1) of a scale family. Theorem 3.3 applies 
and provides the estimator that uniformly minimizes the risk among all estimators 
satisfying (3.25). 

It follows from Theorem 3.3 that such an MRE estimator of r' is given by 

<5 0 (Y) 

(3.29) 5(X) = 4— 

w*( Z) 

where <5o(Y) is any finite risk scale equivariant estimator of x r based on Y = 
(Fi,..., F„_i), where Z = (Zi,..., Z„_i) with 

(3.30) Z, = (i = 1,..., n - 2) and Z„_i = IdzL 

y „- 1 |r„- 1 r 

and where u>*( Z) is any number minimizing 


(3.31) £t= 1 {y[«o(Y)/u>(Z)|Z]}. 

Example 3.11 MRE for normal variance, unknown mean. Let X\ ,..., X n be 

iid according to /V(£. a 2 ) and consider the estimation of a 2 with loss function 
(3.13), r = 2. By Basu’s theorem, (Z, E(Z, — X) 2 ) is independent of Z. If So = 
Z(X, — X) 2 , then So is equivariant under (3.24) and independent of Z. Hence, 
w*(z) = w* in (3.29) is a constant determined by minimizing (3.17) with E(Z, — 
X) 2 in place of X'. Since Y.(X, — X) 2 has the distribution of So of Example 3.7 
with n — 1 in place of n. the MRE estimator for the loss function (3.13) with r = 2 
is S(Xi - X) 2 /(n + 1). || 

Example 3.12 Uniform. Let Xi ,..., X n be iid according to f/(£ — |r, f + ^t), 
and consider the problem of estimating r with loss function (3.13), r = 1. By 
Basu’s theorem, (X lt) . X (n] ) is independent of Z. If So is the range R = X, in — X (l) , 
it is equivariant under (3.24) and independent of Z. It follows from (3.18) with 
r = 1 that (Problem 3.22) S*(X) = [(« + 2 )/n)R. || 


Since the group f' = a + /?§, r' = b r is transitive and the group d' = b' d is 
commutative, it follows (as in the pure scale case) that an MRE estimator is always 
risk-unbiased. 

The principle of equivariance seems to suggest that we should want to invoke 
as much invariance as possible and hence use the largest group G of transforma¬ 
tions leaving the problem invariant. Such a group may have the disadvantage of 
restricting the class of eligible estimators too much. (See, for example. Problem 
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2.7.) To increase the number of available estimators, we may then want to restrict 
attention to a subgroup Go of G. Since estimators that are equivariant under G are 
automatically also equivariant under Go, invariance under Go alone will leave us 
with a larger choice, which may enable us to obtain improved risk performance. 

For estimating the scale parameter in a location-scale family, a natural subgroup 
of (3.24) is obtained by setting a = 0, which reduces (3.24) to the scale group 


(3.32) X\ = bX it = b%i, x’ = br (b > 0), 


and d' = b' d. An estimator <S of t' is equivariant under this group if S(bX) = 
b r 8(X), as in (3.7). Application of Theorem 3.1 shows that the equivariant estima¬ 
tors are of the form 


(3.33) 


<$(x) = 


<$o(x) 

co(x) 


where <5 ( > is any scale equivariant estimator and w( z) is a function of Zi = A, / X n , 
i = 1, ..., n — 1, and z„ = x n /\x n \. However, we cannot now apply Theorem 3.3 
to obtain the MRE estimator, because the group is no longer transitive (Example 
2.14), and the risk of equivariant estimators is no longer constant. 

We can, however, go further in special cases, such as in the following example. 


Example 3.13 More normal variance estimation. If X\,...,X n are iid as 

N(t-, r 2 ), with both parameters unknown, then it was shown in Example 3.11 
that 5o(x) = £(.v; — x) 2 /(n + 1) = S 2 /(n + 1) is MRE under the location-scale 
group (3.24) for the loss function (3.13) with r = 2. 

Now consider the scale group (3.32). Of course, 8 {) is equivariant under this 
group, but so are the estimators 

<5(x) = (p(x/s)s 2 


for some function <p(-) (Problem 3.24). Stein (1964) showed that (p(x/s) = min{(« + 
1) _1 , (n + 2) -1 (l + nx 2 /s 2 )} produces a uniformly better estimator than So. and 
Brewster and Zidek (1974) found the best scale equivariant estimator. See Example 
5.2.15 and Problem 5.2.14 for more details. j 


In the location-scale family (3.23), we have so far considered only the estimation 
of r r ; let us now take up the problem of estimating the location parameter f. The 
transformations (3.24) relating to the sample space and parameter space remain 
the same, but the transformations of the decision space now become d' = a + bd. 
A loss function L(£, r; d) is invariant under these transformations if and only if it 
is of the form 

(3.34) L^,T\d) = p[ 

That any such loss function is invariant is obvious. Conversely, suppose that L 
is invariant and that (£, r :d) and (£', r';r/') are two points with (d' — %')/x' = 
(d — £)/r. Putting b = r'/r and £' — a = b%, one has d' = a + bd, §' = a + b%, 
and r' = br, hence L(£', r'; d') = L(^, r \d), as was to be proved. 

Equivariance in the present case becomes 


( 3 . 35 ) 


8(a + bx ) = a + b8(x), b > 0 . 
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Since G is transitive over the parameter space, the risk of any equivariant estimator 
is constant so that an MRE estimator can be expected to exist. In some special cases, 
the MRE estimator reduces to that derived in Section 1 with r known, as follows. 
For fixed r, write 

(336) g r (x 1 ,...,x n )= \.f( — .—) 

r ,! V r r / 

so that (3.23) becomes 

(3.37) St(*, 

Lemma 3.14 Suppose that for the location family (3.37) and loss function (3.34), 
there exists an MRE estimator 8* of f with respect to the transformations (1.10) 
and (1.11) and that 

(a) 8* is independent of r, and 

(b) 8* satisfies (3.35). 

Then 8* minimizes the risk among all estimators satisfying (3.35). 

Proof. Suppose 8 is any other estimator which satisfies (3.35) and hence, a fortiori, 
is equivariant with respect to the transformations (1.10) and (1.11), and that the 
value r of the scale parameter is known. It follows from the assumptions about 8 * 
that for this r, the risk of <5* does not exceed the risk of 8. Since this is true for all 
values of r, the result follows. □ 

Example 3.15 MRE for normal mean. Let Xi,.. *, X n be iid as N(f. r 2 ), both 
parameters being unknown. Then, it follows from Example 1.15 that 5* = X for 
any loss function p[(d — f )/r] for which p satisfies the assumptions of Example 
1.15. Since (i) and (ii) of Lemma 3.14 hold for this 8 *, it is the MRE estimator of 
£ under the transformations (3.24). j 

Example 3.16 Uniform location parameter. Let X\, ... ,X n be iid as UiS, — 
jT, S + jf). Then, analogous to Example 3.15, it follows from Example 1.19 that 
[X(i) + X(„)]/2 is MRE for the loss functions of Example 3.15. i 

Unfortunately, the MRE estimators of Section 1 typically do not satisfy the 
assumptions of Lemma 3.14. This is the case, for instance, with the estimators of 
Examples 1.18 and 1.22. To derive the MRE estimator without these assumptions, 
let us first characterize the totality of equivariant estimators. 

Theorem 3.17 Let 8o be any estimator f satisfying (3.35) and 8 1 any estimator of 
r taking on positive values only and satisfying 

(3.38) <5i(a + bx) = b8\(x) for all b> 0 and all a. 

Then, 8 satisfies (3.35) if and only if it is of the form 

(3.39) <5(x) = <5 0 (x) — ut(z)<$i(x) 
where z is given by (3.30). 
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Proof. Analogous to Lemma 1.6, it is seen that S satisfies (3.35) if and only if it 
is of the form 

(3.40) <$(x) = <5 0 (x) - m(x)5 1 (x), 
where 

(3.41) u(a + bx) = m(x) for all b > 0 and all a 

(Problem 3.26). That (3.40) holds if and only if u depends on x only through z 
follows from Lemma 1.7 and Theorem 3.1. 

An argument paralleling that of Theorem 1.10 now shows that the MRE esti¬ 
mator of § is 

i(X) = 5 0 (X) - w*(Z)8fX) 
where for each z, u>*( z) is any number minimizing 

(3.42) £ 0 ,i{p[<5o(X) - «,*(z)«i(X)]|z}. 


Here, £b,i indicates that the expectation is evaluated at § = 0, r = 1. 


If, in particular, 

(3.43) p 

it is easily seen that w*( z) is 


d-$ 


(d - H) 2 


(3.44) w *( z ) = £ 0 ,i[5o(X)5i(X)|z]/£ 0 , 1 [^(X)|z]. 


□ 


Example 3.18 Exponential. Let X],, X„ be iid according to the exponential 
distribution E($, r). If <5o(X) = X^) and £i(X) = E[X; — Xqj], it follows from 
Example 1.6.24 that (5o, <ii ) are jointly independent of Z and are also independent 
of each other. Then (Problem 3.25), 


w*( z) -w* = E 


[3 0 (X)3 I (X)] 

£[«i(X)] 


and the MRE estimator of § is therefore 


1 


n 


2 ’ 


s*(X) = x a) - 


4e[X,--X (1) ]. 

n z 


When the best location equivariant estimate is not also scale equivariant, its risk 
is, of course, smaller than that of the MRE under (3.35). Some numerical values 
of the increase that results from the additional requirement are given for a number 
of situations by Hoaglin (1975). j 


For the loss function (3.43), no risk-unbiased estimator S exists, since this would 
require that for all §, fr, and r' 

(3.45) l£ f , T [«(X) - f] 2 < ^£ f , t [5(X) - f 7 ] 2 , 

which is clearly impossible. Perhaps (3.45) is too strong and should be required 
only when r' = r. It then reduces to (1.32) with 6 = (f, r) and g(6) = 
and this weakened form of (3.45) reduces to the classical unbiasedness condi¬ 
tion £|, r [(5(X)] = f. A UMVU estimator of £ exists in Example 3.18 (Problem 
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2.2.18), but it is 

i(X) = X a) - 1 - £[*,- - X a) ] 
n(n — 1) 

rather than 5*(X), and the latter is not unbiased (Problem 3.27). 

4 Normal Linear Models 

Having developed the theory of unbiased estimation in Chapter 2 and of equivariant 
estimation in the first three sections of the present chapter, we shall now apply these 
results to some important classes of statistical models. One of the most widely used 
bodies of statistical techniques, comprising particularly the analysis of variance, 
regression, and the analysis of covariance, is formalized in terms of linear models, 
which will be defined and illustrated in the following. The examples, however, 
are not enough to give an idea of the full richness of the applications. For a more 
complete treatment, see, for example, the classic book by Scheffe (1959), or Seber 
(1977), Arnold (1981), Searle (1987), or Christensen (1987). 

Consider the problem of investigating the effect of a number of different factors 
on a response. Typically, each factor can occur in a number of different forms or at 
a number of different levels. Factor levels can be qualitative or quantitative. Three 
possibilities arise, corresponding to three broad categories of linear models: 

(a) All factor levels qualitative. 

(b) All factor levels quantitative. 

(c) Some factors of each kind. 

Example 4.1 One-way layout. A simple illustration of category (a) is provided 
by the one-way layout in which a single factor occurs at a number of qualitatively 
different levels. For example, we may wish to study the effect on performance of 
a number of different textbooks or the effect on weight loss of a number of diets. 
If X,j denotes the response of the y'th subject receiving treatment i, it is often 
reasonable to assume that the A, ; are independently distributed as 

(4.1) X u : N&, a 2 ), j = 1; i = l,...,s. 

Estimands that may be of interest are £,■ and §,■ — (l/^)EJ =1 fy. || 

Example 4.2 A simple regression model. As an example of type (b), consider 
the time required to memorize a list of words. If the number of words presented to 
the ;th subject and the time it takes the subject to learn the words are denoted by 
tj and Xj , respectively, one might assume that for the range of f’s of interest, the 
X’s are independently distributed as 

(4.2) X, : N(a + + ytf, o 2 ) 

where a, (i, and y are the unknown regression coefficients, which are to be esti¬ 
mated. 

This would turn into an example of the third type if there were several groups 
of subjects. One might, for example, wish to distinguish between women and men 
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or to see how learning ability is influenced by the form of the word list (whether 
it is handwritten, typed, or printed). The model might then become 

(4.3) Xij : N(oii + fotij + Yitfj, o 2 ) 

where X/j is the response of the /th subject in the ith group. Here, the group is a 
qualitative factor and the length of the list a quantitative one. 

The general linear model, which covers all three cases, assumes that 

(4.4) Xj is distributed as N (§,, a 2 ), i = 1,..., n, 

where the X ; are independent and (£,,..., f„) e P[ £J , an 5 -dimensional linear 
subspace of E n (s < n). 

It is convenient to reduce this model to a canonical form by means of an orthog¬ 
onal transformation 

(4.5) Y = XC 

where we shall use Y to denote both the vector with components {Y \,..., Y„) and 
the row matrix (Y\,..., T„). If iy = E(Y,), the r/’s and £’s are related by 

(4.6) ? = $C 

where j; = (^. r] n ) and § = (fi,... £„). 

To find the distribution of the T’s, note that the joint density of Xi,..., X n is 

i r i 

77 =- 7 exp - T,(Xi - §,) , 

(V27rcr)' ! L 2 ct- J 

that 

E(*/ - ft ) 2 = S(y,- - i li ) 2 , 

since C is orthogonal, and that the Jacobian of the transformation is 1. Hence, the 
joint density of Y\,..., Y n is 

1 r 1 

77 =—7 exp - . 

(V2jt(ry i L 2ct- J 

The T’s are therefore independent normal with T, ~ /V (>];, cr 2 ), i = 1,If 
c ; - denotes the ith column of C, the desired form is obtained by choosing the C; so 
that the first s columns cj.c'. span |~[ n . Then, 

§ e ]~[ 4=7 § is orthogonal to the last n — s columns of C. 

Q. 

Since t; = §C, it follows that 

(4.7) § e J"] Vs+i = ••• = Vn = 0. 

In terms of the T’s, the model (4.4) thus becomes 

(4.8) Yj : N(r]i, a 2 ), i = l,...,s, and Yj : N(0,cr 2 ), j=s+l,...,n. 

As (§i,, ^„) varies over (rji,..., n s ) varies unrestrictedly over E s while 
r/s+i = ■■■ = >i„ = 0. 
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In this canonical model, Y\,, Y s and S 2 = E"_ s+| Yj are complete sufficient 
statistics for (?;i ,. . . , q s , a 2 ). 

Theorem 4.3 

(a) The UMVU estimators of ’E s i=l Xjr)j (where the X’s are known constants) and 
o 2 are E ( S =1 A. ( T; and S 2 /(n — s), respectively. (Here, UMVU is used in the 
strong sense of Section 2.1.) 

(b) Under the transformations 

Y; = Y, + a, (* = 1.s); Yj = Yj (j = s + 1,..., n) 

S 

rj'i = 1 ], + a,- (i = 1,..., s)~, and d' = d + ajX t 

i =1 

and with loss function L(i), d) = p(d — El, ?;,-) where p is convex and even, 
the UMVU estimator E? =1 A.,Tj is also the MRE estimator o/EU A, 

(c) Under the loss function (d—o 2 ) 2 /a 4 , the MRE estimator ofa 2 is S 2 /(n—s+2). 
Proof. 

(a) Since E? =1 X, Yj and S 2 /(n — s) are unbiased and are functions of the complete 
sufficient statistics, they are UMVU. 

(b) The condition of equivariance is that 

8(Yi+c u ...,Y s + c„Y, +u ...,Y n ) 

S 

= 8(Y l ,...,Y s ,Y s+u ...,Y n ) + J2^Ci 

1 = 1 

and the result follows from Problem 2.27. 

(c) This follows essentially from Example 3.7 (see Problem 4.3). 

□ 

It would be more convenient to have the estimator expressed in terms of the 
original variables X \, ..., X„, rather than the transformed variables Y\,... ,Y„. 
For this purpose, we introduce the following definition. 

Let £ = (§i, ...,£„) be any vector in |~[ Q . Then, the least squares estimators 
(LSE) (|i,..., f„) of (§i,.... f„) are those estimators which minimize Ef =1 (X,- — 
f ) 2 subject to the condition § e P[ t ,. 

Theorem 4.4 Under the model (4.4), the UMVU estimator of E, y ,-is E" =1 y, |i. 
Proof. By Theorem 4.3 (and the completeness of Y\, .... Y s and S 2 ), it suffices 
to show that E” =| y,|, is a linear function of Y\,..., Y s , and that it is unbiased for 
£" =1 YiHi- Now, 

(4.9) JjiXi - Hi? = itiYi ~ E(Yj)] 2 = J2(Y< ~ hi? + Yl Yj. 

1 = 1 1 = 1 1 = 1 j=S +1 

The right side is minimized by ?), = Y, (i = 1..... ,v), and the left side is minimized 
by fi,..., Hn- Hence, 

(Ti • • • U 0 • • • 0) = (|i • • • Hn)C = |C 
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so that 

I = (Fi •••y s 0---0)C- 1 . 

It follows that each f and, therefore, E" =] y,f is a linear function of Y\. ..., Y s . 
Furthermore, 

E(b = E[(Y , •••)',()••• 0)C _1 ] = (m ■ ■ ■ n, 0• • • 0)C-‘ = i 
Thus, each f; is unbiased for consequently, X” =| is unbiased for X" =1 

□ 

It is interesting to note that each of the two quite different equations 

X = (T, ■ ■ ■ Y n )C~ l and f = (F x • • • Y s 0 • • • 0)C _I 

leads to § = (jft,..., r) s 0 • • • 0)C -1 by taking expectations. 

Let us next reinterpret the equivariance considerations of Theorem 4.3 in terms 
of the original variables. It is necessary first to specify the group of transformations 
leaving the problem invariant. The transformations of Y -space defined in Theorem 
4.3(b), in terms of the X’s become X' = X, + bj, i = 1but the b, are not 
arbitrary since the problem remains invariant only if §' = £ +b e P[,,; that is, the /;, 
must satisfy b = {b \,..., b„) e ]~[ n . Theorem 4.3(h) thus becomes the following 
corollary. 

Corollary 4.5 Under the transformations 

(4.10) X' = X + b with b e ]""[, 

n 

E'] = \Yi ki is MRE for estimating E" =| with the loss function p(d — Ey,-^,) pro¬ 
vided p is convex and even. 

To obtain the UMVU and MRE estimators of a 2 in terms of the X’s, it is only 
necessary to reexpress S 2 . From the minimization of the two sides of (4.9), it is 
seen that 

(4.11) JjiXi - I,) 2 = J2 Yj = S 2 . 

i =1 j =*+1 

The UMVU and MRE estimators of a 2 given in Theorem 4.3, in terms of the X’s 
are therefore E(X,- — |,-) 2 /(n — s) and E(X; — |,) 2 /(n — s + 2), respectively. 

Let us now illustrate these results. 

Example 4.6 Continuation of Example 4.1. Let X, ; - be independent N(f,, o 2 ), 
j = 1i = 1, .... .v. To find the UMVU or MRE estimator of a linear 
function of the it is only necessary to find the least squares estimators f,. 
Minimizing 

J2 J2 (x u - Hif = iz - x - )2+n > (x - - %i) 2 ’ 

i= 1 7=1 i = l L 7=1 




we see that 
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From (4.11), the UMVU estimator of er 2 in the present case is seen to be 

S tli 

a 2 = H (X ‘.i ~ *»’-) 2 /(E"» - s). 

i=i j =t 


Example 4.7 Simple linear regression. Let A, be independent IV(ft, a 2 ), i = 
1,... ,n, with ft = a + ft ft , ft known and not all equal. Here, |~[ n is spanned by 
the vectors (1, ..., 1) and (ti,..., t n ) so that the dimension of is s = 2. The 
least squares estimators of ft are obtained by minimizing ^” =1 (X; — a — ftt,) 2 
with respect to a and ft. It is easily seen that for any i and j with f ( - ft tj. 


(4.12) 


ft - ft 


ft ft ft ft' 


and that ft and a are given by the same functions of ft and ft (Problem 4.4). Hence, 
a and ft are the best unbiased and equivariant estimators of a and ft. respectively. 

Note that the representation of a and ft in terms of the ft’s is not unique. Any 
two ft and ft values with t ,• ft tj determine a and ft and thus all the fts. The reason, 
of course, is that the vectors (ft, ..., ft) lie in a two-dimensional linear subspace 
of H-space. i 


Example 4.7 is a special case of the model specified by the equation 

(4.13) § = 0A 

where 6 = (&\ • • ■ Oft) are ,v unknown parameters and A is a known s x n matrix of 
rank s, the so-called full-rank model. In Example 4.7, 

0 = (a. ft) and A = ( ' 

\ L ‘ ‘ ‘ tn 

The least squares estimators of the ft in (4.13) are obtained by minimizing 

n 

Jj x > - ftW] 2 

i=i 

with respect to 0. The minimizing values (ft are the LSEs of ft, and the LSEs of 
the ft are given by 

(4.14) \=0 A. 

Theorems 4.3 and 4.4 establish that the various optimality results apply to the 
estimators of the ft and their linear combinations. The following theorem shows 
that they also apply to the estimators of the 0’s and their linear functions. 

Theorem 4.8 Let X, ~ Aftft, er 2 ), i = \..... n, be independent, and let § satisfy 

(4.13) with A of rank s. Then, the least squares estimator 0 of0 is a linear function 
of the ft and hence has the optimality properties established in Theorems 4.3 and 
4.4 and Corollary 4.5. 

Proof. It need only be shown that 0 is a linear function of ft then, by (4.13) and 

(4.14) , 0 is the corresponding linear function of ft 
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Assume without loss of generality that the first s columns of A are linearly 
independent, and form the corresponding nonsingular .v x s submatrix A*. Then, 

(£f-&) = (0i---0,)A*, 

so that 

and this completes the proof. □ 

Typical examples in which § is given in terms of (4.13) are polynomial regres¬ 
sions such as 

Hi = a + Ph + ytf 

or regression in more than one variable such as 

Hi =oi + PU + yuj 

where the t’s and u’s are given, and a, P, and y are the unknown parameters. Or 
there might be several regression lines with a common slope, say 

Hij = Oii + ptij O' = 1,, nr, i = 1,..., a), 

and so on. 

The full-rank model does not always provide the most convenient parametriza- 
tion; for reasons of symmetry, it is often preferable to use the model (4.13) with 
more parameters than are needed. Before discussing such models more fully, let 
us illustrate the resulting difficulties on a trivial example. Suppose that §,■ = § for 
all i and that we put £,■ = X + fx. Such a model does not define /. and ji uniquely but 
only their sum. One can then either let this ambiguity remain but restrict attention 
to clearly defined functions such as X + /x, or one can remove the ambiguity by 
placing an additional restriction on X and //, such as /x — A, = 0, /x = 0, or /. = 0. 
More generally, let us suppose that the model is given by 

(4.15) H = 0A 

where A is a t x n matrix of rank s < t. To define the 0's uniquely, (4.15) is 
supplemented by side conditions 

(4.16) 6B = 0 

chosen so that the set of equations (4.15) and (4.16) has a unique solution 6 for 
every H e Fin- 

Example 4.9 Unbalanced one-way layout. Consider the one-way layout of Ex¬ 
ample 4.1, with Xij (j = 1,... ,ny, i = 1, ..., s ) independent normal variables 
with means and variance a 2 . When the principal concern is a comparison of the 
s treatments or populations, one is interested in the differences of the |’s and may 
represent these by means of the differences between the and some mean value 
/x, say a, = £,■ — fi. The model then becomes 


(4.17) 


Hi = M + oit , 


i = 1,..., s, 
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which expresses the s fs in terms of s + 1 parameters. To specify the parameters, 
an additional restriction is required, for example, 

(4.18) = 0. 

Adding the s equations (4.17) and using (4.18), one finds 

(4.19) /z=£-=f 

s 

and hence 

(4.20) «/=&-!. 

The quantity a,- measures the effect of the ith treatment. Since Xj. is the least 
squares estimator of the UMVU estimators of p. and the a’s are 

X X ■ 

(4.21) p = £— = EE— and «, = X,. - p. 

s sn. 

When the sample sizes n,- are not all equal, a possible disadvantage of this 
representation is that the vectors of the coefficients of the X/j in the a, are not 
orthogonal to the corresponding vector of coefficients of p [Problem 4.7(a)]. As a 
result, p is not independent of the a,. Also, when the a, are known to be zero, the 
estimator of /i is no longer given by (4.21) (Problem 4.8). 

For these reasons, the side condition (4.18) is sometimes replaced by 

(4.22) Y.n,a, = 0, 
which leads to 

(4.23) /r=E^ = | (JV=E«i) 
and hence 

(4.24) «/=&-!. 

Although the a, of (4.22) seems to be a less natural measure of the effect of the ith 
treatment, the resulting UMVU estimators a ; and p have the orthogonality property 
not possessed by the estimators (4.21) [Problem 4.7(b)]. The side conditions (4.18) 
and (4.22), of course, agree when the n, are all equal. j 

The following theorem shows that the conclusion of Theorem 4.8 continues to 
hold when the 0’s are defined by (4.15) and (4.16) instead of (4.13). 

Theorem 4.10 Let Xj be independent N(f, rr 2 ), i = 1 with i; e P[ L) , 
an s-dimensional linear subspace of E„. Suppose that (6 \,..., 9,) are uniquely 
determined by (4.15) and (4.16), where A is of rank s < t and B of rank k. Then, 
k = t — s, and the optimality results of Theorem 4.4 and Corollary 4.5 apply to the 
parameters 6\ . 9 t and their least squares estimators 0\,... ,0 t . 

Proof. Let 9\ , ..., 9 t be the LSEs of 9\ ,..., 6 t , that is, the values that minimize 

n 

£[ *i-&(0)] 2 

i=l 
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subject to (4.15) and (4.16). It must be shown, as in the proof of Theorem 4.8, that 
the Qj’s are linear functions of ..., f„, and that the 0,-’s are the same functions 
Of 

Without loss of generality, suppose that the 9’s are numbered so that the last k 
columns of B are linearly independent. Then, one can solve for 0 t -k +\,..., 6, in 
terms of 9 \,..., Q t -k , obtaining the unique solution 

(4.25) 9j = Lj(9i, for j = t — k + 1,..., t. 

Substituting into § = 9A gives 

§ = ($!••• 9 t -k)A* 

for some matrix A*, with (9\, ..., !),_/,) varying freely in E,_ k . Since each § e P[,, 
uniquely determines 9, in particular the value £ = 0 has the unique solution 9 = 0, 
so that (9 1 • • • 9 t -k)A* = 0 has a unique solution. This implies that A* has rank 
t — k. On the other hand, since § ranges over a linear space of dimension s, it 
follows that t — k = s and, hence, that k = t — s. 

The situation is now reduced to that of Theorem 4.8 with § a linear function of 
t — k = s freely varying 0’s, so the earlier result applies to 9\, ..., 9,- k - Finally, 
the remaining parameters 9 t -k+ 1 ,..., 9, and their LSEs are determined by (4.25), 
and this completes the proof. □ 

Example 4.11 Two-way layout. A typical illustration of the above approach is 
provided by a two-way layout. This arises in the investigation of the effect of two 
factors on a response. In a medical situation, for example, one of the factors might 
be the kind of treatment (e.g., surgical, nonsurgical, or no treatment at all), the 
other the severity of the disease. Let X t j k denote the response of the kth subject to 
which factor 1 is applied at level i and factor 2 at level j. We assume that the X/jk 
are independently, normally distributed with means and common variance a 2 . 
To avoid the complications of Example 4.9, we shall suppose that each treatment 
combination O', j) is applied to the same number of subjects. If the number of 
levels of the two factors is a and b , respectively, the model is thus 

(4.26) X, jk ■ N (%,j , a 1 ), f = 1; j J\ k = 1. m. 

This model is frequently parametrized by 

(4.27) = /i + otj + + Yij 

with the side conditions 

(4.28) I> = £ Pi = E YU = E YU = 0. 

i j i J 

It is easily seen that (4.27) and (4.28) uniquely determine [i and the a’s. P’s, 

and y’s. Using a dot to denote averaging over the indicated subscript, we find by 
averaging (4.27) over both i and j and separately over i and over j that 

= 9, Hi- = 9 + «;, H-j = 9 + Pj 


and hence that 
(4.29) 


9 — — Hi- H--t Pj — H-j H--’ 
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and 

(4.30) yi j=Hij-Hi--H-j+H- ■ 


Thus, a, is the average effect (averaged over the levels of the second factor) of 
the first factor at level i, and ft j is the corresponding effect of the second factor at 
level j. The quantity Yij can be written as 

(4.31) Yi j = - $■) - [(&• -§••) + - ?••)]• 

It is therefore the difference between the joint effect of the two treatments at levels 
i and j, respectively, and the sum of the separate effects a ,■ + fj. The quantity 
Yij is called the interaction of the two factors when they are at levels i and j, 
respectively. 

The UMVU estimators of these various effects follow immediately from Theo¬ 
rem 4.3 and Example 4.6. This example shows that the UMVU estimator of l;jj is 
Xjj and the associated estimators of the various parameters are thus 

(4.32) p, = X..., a<i = Xj..-X..., fa = X.j. - X..., 

and 

(4.33) )>ij = Xij. - Xi.. - X.j. + X... . 

The UMVU estimator of o 2 is 

1 9 

(4.34) -ESE(X ift -X ;/ .) 2 . 

These results for the two-way layout easily generalize to other factorial experi¬ 
ments , that is, experiments concerning the joint effect of several factors, provided 
the numbers of observations at the various combinations of factor levels are equal. 
Theorems 4.8 and 4.10, of course, apply without this restriction, but then the situ¬ 
ation is less simple. 

Model (4.4) assumes that the random variables X, are independently normally 
distributed with common unknown variance cr 2 and means which are subject 
to certain linear restrictions. We shall now consider some models that retain the 
linear structure but drop the assumption of normality. 

(i) A very simple treatment is possible if one is willing to restrict attention to 
unbiased estimators that are linear functions of the A, and to squared error loss. 
Suppose we retain from (4.4) only the assumptions about the first and second 
moments of the A,, namely 

(4.35) E{Xi) = %i, §e]“[, 

var(A,) = cr 2 , cov(A,-, Xj) = 0 for i j. 

Thus, both the normality and independence assumptions are dropped. 

Theorem 4.12 (Gauss’ Theorem on Least Squares) Under assumptions (4.35), 
E"_| Yi^i of Theorem 4.4 is UMVU among all linear estimators o/E" =1 y,§;. 
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Proof. The estimator is still unbiased, since the expectations of the X, are the same 
under (4.35) as under (4.4). Let E"_, c, X/ be any other linear unbiased estimator 
of E" =1 y,-£|. Since E" =1 y ( f; is UMVU in the normal case, and variances of linear 
functions of the X, depend only on first and second moments, it follows that 
var E" =1 y,-£j < var E” =| c,- X t . Hence, E" =| y,^- is UMVU among linear unbiased 
estimators. □ 

Corollary 4.13 Under the assumptions (4.35) and with squared error loss, £" =1 y;f; 
is MRE with respect to the transformations (4.10) among all linear equivariant 
estimators of E" =1 y,§,. 

Proof. This follows from the argument of Lemma 1.23, since E” =| y, |, is UMVU 
and equivariant. □ 

Theorem 4.12, which is also called the Gauss-Markov theorem, has been ex¬ 
tensively generalized (see, for example, Rao 1976, Harville 1976, 1981, Kariya, 
1985). We shall consider some extensions of this theorem in the next section. On 
the other hand, the following result, due to Shaffer (1991), shows a direction in 
which the theorem does not extend. If, in (4.35), we adopt the parametrization 
£ = 6 A for some s x n matrix A , there are some circumstances in which it is rea¬ 
sonable to assume that A also has a distribution (for example, if the data (X, A) are 
obtained from a sample of units, rather than A being a preset design matrix as is the 
case in many experiments). The properties of the resulting least squares estimator, 
however, will vary according to what is assumed about both the distribution of A 
and the distribution of X. Note that in the following theorem, all expectations are 
over the joint distribution of X and A. 

Theorem 4.14 Under assumptions (4.35), with f = 6 A, the following hold. 

(a) If (X, A) are jointly multivariate normal with all parameters unknown, then 
the least squares estimator E y,- if, is the UMVU estimator of E y, . 

(b) If the distribution of A is unknown, then the least squares estimator Ey ( f, is 
UMVU among all linear estimators ofEy^i. 

(c) If E(AA') is known, no best linear unbiased estimator of Ey,-^,- exists. 

Proof. Part (a) follows from the fact that the least squares estimator is a function 
of the complete sufficient statistic. Part (b) can be proved by showing that if E y,- f 
is unconditionally unbiased then it is conditionally unbiased, and hence Theorem 
4.12 applies. For this purpose, one can use a variation of Problem 1.6.33, where it 
was shown that the order statistics are complete sufficient. Finally, part (c) follows 
from the fact that the extra information about the variance of A can often be used 
to improve any unbiased estimator. See Problems 4.16-4.18 for details. □ 

The formulation of the regression problem in Theorem 4.14, in which the p 
rows of A are sometimes referred to as “random regressors,” has other interesting 
implications. If A is ancillary, the distribution of A and hence E(A'A) are known 
and so we have a situation where the distribution of an ancillary statistic will 
affect the properties of an estimator. This paradox was investigated by Brown 
(1990a), who established some interesting relationships between ancillarity and 
admissibility (see Problems 5.7.31 and 5.7.32). 
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For estimating a 2 , it is natural to restrict attention to unbiased quadratic (rather 
than linear) estimators Q of a 2 . Among these, does the estimator S 2 /(n — s ) 
which is UMVU in the normal case continue to minimize the variance? Under 
mild additional restrictions—for example, invariance under the transformations 
(4.10) or restrictions to Q's taking on only positive values—it turns out that this 
is true in some cases (for instance, in Example 4.15 below when the «,■ are equal) 
but not in others. For details, see Searle et al. (1992, Section 11.3). 

Example 4.15 Quadratic unbiased estimators. Let Xij O' = 1. nr, i = 1, 

..., s) be independently distributed with means E(Xjj) = and common variance 
and fourth moment 

a 2 = E(Xjj — §,) 2 and ft = E(X tj - &) 4 /<r 4 , 

respectively. Consider estimators of a 2 of the form Q = E /.,■ Sj where Sf = 
E(X iy - — Xj .) 1 and EA,(n, — 1) = 1 so that Q is an unbiased estimator of a 2 . Then, 
the variance of Q is minimized (Problem 4.19) when the /,’s are proportional to 
1 /(a,- + 2) where a, = [(n, — \)/nj](ft — 3). The standard choice of the A.,- (which 
is to make them equal) is, therefore, best if either the n, are equal or ft = 3, which 
is the case when the Xjj are normal. 

(ii) Let us now return to the model obtained from (4.4) by dropping the as¬ 
sumption of normality but without restricting attention to linear estimators. More 
specifically, we shall assume that X t ,..., X„ are random variables such that 


the variables X, — §,■ are iid with a common distribution F 

(4.36) which has expectation zero and an otherwise unknown 
probability density /, 

and such that (4.13) holds with A an n x n matrix of rank s. j 

In Section 2.4, we found that for the case = 9 , the LSE X of 9 is UMVU in this 
nonparametric model. To show that the corresponding result does not generally 
hold when § is given by (4.13), consider the two-way layout of Example 4.11 and 
the estimation of 

1 1 1 

(4.37) - Hjk). 

J 7=1 *=1 

To avoid calculations, suppose that F is t-2, the f-distribution with 2 degrees of 
freedom. Then, the least squares estimators have infinite variance. On the other 

hand, let Xij be the median of the observations Xjj V , v = 1. m. Then Xik—Xjk 

is an unbiased estimator of ^ so that S = (1 /afe)ES(X/t — Xjk) is an 

unbiased estimator of a,-. Furthermore, if m > 3, the Xjj have finite variance and 
so, therefore, does 8. (A sum of random variables with finite variance has finite 
variance.) This shows that the least squares estimators of the a,- are not UMVU 
when F is unknown. The same argument applies to the ft's and y's. 

The situation is quite different for the estimation of //. Let U be the class of 
unbiased estimators of /x in model (4.27) with F unknown, and let U' be the cor- 
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responding class of unbiased estimators when the a’s , ffs, and y's are all zero. 
Then, clearly, U C U'\ furthermore, it follows from Section 2.4 that X... uniformly 
minimizes the variance within U’. Since X . is a member of U, it uniformly mini¬ 
mizes the variance within U and, hence, is UMVU for // in model (4.27) when F 
is unknown. 

For a more detailed discussion of this problem, see Anderson (1962). 

(iii) Instead of assuming the density / in (4.36) to be unknown, we may be 
interested in the case in which / is known but not normal. The model then remains 
invariant under the transformations 

S 

(4.38) X’ v = X v + 'Y^ j a j vYj, -oo < yi,..., y s < oo. 

7=1 

Since E(X’ V ) = 'Eaj V (0j + yf), the induced transformations in the parameter space 
are given by 

(4.39) e' J = e j + Yj o' = i,...,s). 

The problem of estimating 9j remains invariant under the transformations (4.38), 

(4.39) , and 

(4.40) d' = d + Yj 

for any loss function of the form p(d — 0j), and an estimator <5 of 9j is equivariant 
with respect to these transformations if it satisfies 

(4.41) S(X') = S(X) + y r 

Since (4.39) is transitive over £2, the risk of any equivariant estimator is constant, 
and an MRE estimator of 0j can be found by generalizing Theorems 1.8 and 1.10 
to the present situation (see Verhagen 1961). 

(iv) Important extensions to random and mixed effects models, and to general 
exponential families, will be taken up in the next two sections. 

5 Random and Mixed Effects Models 

In many applications of linear models, the effects of the various factors A, B, C. ... 
which were considered to be unknown constants in Section 3.4 are, instead, ran¬ 
dom. One then speaks of a random effects model (or Model II); in contrast, the 
corresponding model of Section 3.4 is a fixed effects model (or Model I). If both 
fixed and random effects occur, the model is said to be mixed. 

Example 5.1 Random effects one-way layout. Suppose that, as a measure of 
quality control, an auto manufacturer tests a sample of new cars, observing for 
each car, the mileage achieved on a number of occasions on a gallon of gas. 
Suppose Xij is the mileage of the ith car on the /th occasion, at time r, ; , with all 
the being selected at random and independently of each other. This would have 
been modeled in Example 4.1 as 

Xij — j i + u j + Ujj 

where the Ujj are independent /V(0, cr 2 ). Such a model would be appropriate if 
these particular cars were the object of study and a replication of the experiment 
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thus consisted of a number of test runs by the same cars. However, the manufacturer 
is interested in the performance of the thousands of cars to be produced that year 
and, for this reason, has drawn a random sample of cars for the test. A replication 
of the experiment would start by drawing a new sample. The effect of the ith car 
is therefore a random variable, and the model becomes 


(5.1) X u = /x + Ai + Uij O' = 1 , i = 1 , 

Here and following, the populations being sampled are assumed to be large enough 
so that independence and normality of the unobservable random variables A, and 
Ujj can be assumed as a reasonable approximation. Without loss of generality, one 
can put E(Aj) = E(Uij ) = 0 since the means can he absorbed into /x. The variances 
will be denoted by var(A ( ) = a 2 A and var((/ ;/ ) = cr 2 . 

The Xjj are dependent, and their joint distribution, and hence the estimation 
of cr A and a 2 , is greatly simplified if the model is assumed to be balanced, that 
is, to satisfy n, = n for all i. In that case, in analogy with the transformation 
(4.5), let each set (A,i,..., X in ) be subjected to an orthogonal transformation to 
(Iji,..., Y jn ) such that Y, = +Jn X,.. An additional orthogonal transformation 
is made from (Tn,..., T sl ) to (Z n , ..., Z s {) such that Z u = «JsY .whereas 
for i > 1, we put Z ;/ - = K, ; . Unlike the A, ; , the K,, and Z; 7 - are all independent 
(Problem 5.1). They are normal with means 

E(Z \\) = «J~sn /x, E(Zjj) = 0 if i > 1 or j > 1 

and variances 


var(Z/!) = a 2 + ncr^, var(Z/ 7 ) = a 2 for j > 1, 

so that the joint density of the Z’s is proportional to 
1 


(5.2) 

with 


exp 


2(cr 2 + ncr^) 


[(Z n - V^m/x ) 2 + Si] - ^S 


E 4= n * x i- - x -) 2 . s 2 = E E z h = E X>‘7 - ^-) 2 - 


i =2 i =1 7=2 

This is a three-parameter exponential family with 
/x 1 


i =1 7=1 


(5.3) 


>71 


cr- + no-. 


m 


'A (J - + na A 

The variance of A, ; is var( X t] ) = cr 2 + a 2 , and we are interested in estimating the 
variance components and o 2 . Since 

S 2 


?73 


1 

~Zi' 


S 2 

s — 1 


o 2 +nct\ and E 


s(n — 1) 


it follows that 
(5.4) 


<r 2 = 


s(n — 1) 


1 

and a, = — 


c2 


1 s(n — 1) 
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are UMVU estimators of a 1 and erj, respectively. The UMVU estimator of the 
ratio is cr^/er 2 is 

1 1“ Kf —2 a 2 + na\ 
n s(n — 1) <r 2 

where Kf _2 is given by (2.2.5) with / = s(n — 1) (Problem 5.3). Typically, the only 
linear subspace of the 77 ’s of interest here is the trivial one defined by erj = 0, which 
corresponds to rj 2 = 173 and to the case in which the sn Xij are iid as N(n, a 2 ). || 

Example 5.2 Random effects two-way layout. In analogy to Example 4.11, con¬ 
sider the random effects two-way layout. 

(5.5) X ijk = /i + Aj + Bj + Cjj + Ujjk 

where the unobservable random variables A,, Bj, Cjj , and are independently 
normally distributed with zero mean and with variances a\, ctJ, a^, and <r 2 , respec¬ 
tively. We shall restrict attention to the balanced case i = 1 , j = 1 ,...» J , 
and k = 1,.. .n. As in the preceding example, a linear transformation leads to 
independent normal variables Z,-^ with means E(Z m) = V I Jn // and 0 for all 
other Z ’s and with variances 

var(Zm) = nJo\ + n ^ a B + na c + ° 2 ’ 
var(Zni) = nJa\ + ncx^ + a 2 , i > 1, 

(5.6) var(Ziji) = nla\ + no^ + a 2 , j > 1, 
var(Z,ji) = ner^ + a 2 , i, j > 1, 

var (Z ijk ) = a 2 , k > 1. 

As an example in which such a model might arise, consider a reliability study of 
blood counts, in which blood samples from each of J patients are divided into nl 
subsamples of which n are sent to each of I laboratories. The study is not concerned 
with these particular patients and laboratories, which, instead, are assumed to be 
random samples from suitable patient and laboratory populations. From (5.5) it 
follows that v‘dr(Xijf,) = o\ + a\ + er^ + a 2 . The terms on the right are the variance 
components due to laboratories, patients, the interaction between the two, and the 
subsamples from a patient. 

The joint distribution of the Z l ; j. constitutes a five-parameter exponential family 
with the complete set of sufficient statistics (Problem 5.9) 

S 2 A = Y / Zf u =njJ2(X i ..-X...) 2 , 

7=2 7=1 

^ = E z ui = w/ E( z t- x -) 2 - 

j= 2 j= 1 

Sc = EE z m = n E - x -x +x - )2 > 

s 2 =ii£ z ? jk =ti£( x u k - x u-) 2 , 

7=1 j= 1 k-2 7=1 7=1 k—\ 



(5.7) 
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Z in = VTTnX.... 

From the expectations of these statistics, one finds the UMVU estimators of the 
variance components a 2 , <r f 2 ., a 2 , and erj to be 


S 1 1 r S 2 

J CT 2 = - J C 

//(«-!)’ c n [_(/ — 1)(7 — 1) 


1 

r c 2 1 

21 no 2 o 2 

CT 2 - — 

r s 2 1 

B no 2 o 2 

nJ 

1 c 

’ B nl 

.7-1 c 


A submodel of (5.5), which is sometimes appropriate, is the additive model 
corresponding to the absence of the interaction terms C, ; and hence to the as¬ 
sumption er^ = 0. If i]\ = /u/var(Zm), 1 /?72 = nJo\ + no^ + a 2 , 1/173 = 
nlo J + hoq + cr 2 , 1/174 = no^ + a 2 , and 1/175 = a 2 , this assumption is equiv¬ 
alent to 174 = 175 and thus restricts the r]’s to a linear subspace. The submodel 
constitutes a four-parameter exponential family, with the complete set of sufficient 
statistics Z in , S\, Sg , and S ' 2 = S ’ 2 = VZ'Z(X ijk -X i ..-X.j.+X...) 2 . The UMVU 
estimators of the variance components o\, oj,, and cr 2 are now easily obtained as 
before (Problem 5.10). 

Another submodel of (5.5) which is of interest is obtained by setting o\ = 0, 
thus eliminating the Bj terms from (5.5). However, this model, which corresponds 
to the linear subspace 173 = 174 , does not arise naturally in the situations leading to 
(5.5), as illustrated by the laboratory example. These situations are characterized 
by a crossed design in which each of the IA units (laboratories) is observed in 
combination with each of the JB units (patients). On the other hand, the model 
without the B terms arises naturally in the very commonly occurring nested design 
illustrated in the following example. j 

Example 5.3 Two nested random factors. For the two factors A and B , suppose 
that each of the units corresponding to different values of i (i.e., different levels 
of A) is itself a collection of smaller units from which the values of B are drawn. 
Thus, the A units might be hospitals, schools, or farms that constitute a random 
sample from a population of such units from each of which a random sample of 
patients, students, or trees is drawn. On each of the latter, a number of observations 
is taken (for example, a number of blood counts, grades, or weights of a sample 
of apples). The resulting model [with a slight change of notation from (5.5)] may 
be written as 

(5.8) Xjjj r. = fi + Aj + Bjj + Ujjk. 

Here, the A’s, B’ s, and U’s are again assumed to be independent normal with 
zero means and variances <r 2 , erj, and o 2 , respectively. In the balanced case 
( i = 1 , j = 1 , k = 1 ,..., n), a linear transformation produces 
independent variables with means E(Z n 1 ) = I Jn /i and = 0 for all other Z’s 
and variances 

var(Z;n) = o 2 + no^ + Jno 2 (/ = 1, ..., 1), 
var(Z l71 ) = o 2 + nog (j > 1), 
var (Z ijk ) = o 2 (k > 1). 
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The joint distribution of the Z’s constitutes a four-parameter exponential family 
with the complete set of sufficient statistics 

/ 

S 2 A = J2 z m = JnH{X i .. -X...) 2 , 

1=2 

J 

(5.9) S 2 h = J2 Aj t = - X,-..) 2 , 

s 2 = titA Jk = ±f±(x, Jk -x u .)\ 

i = 1 7=1 / c =2 /=1 7=1 &=1 

Z 1U =^JjnX..., 

and the UMVU estimators of the variance components can be obtained as before 
(Problem 5.12). || 

The models illustrated in Examples 5.2 and 5.3 extend in a natural way to more 
than two factors, and in the balanced cases, the UMVU estimators of the variance 
components are easily derived. 

The estimation of variance components described above suffers from two serious 
difficulties. 

(i) The UMVU estimators of all the variance components except a 2 can take 
on negative values with probabilities as high as .5 and even in excess of that value 
(Problem 5.5-5.7) (and, correspondingly, their expected squared errors are quite 
unsatisfactory; see Klotz, Milton, and Zacks 1969). 

The interpretation of such negative values either as indications that the associated 
components are negligible (which is sometimes formalized by estimating them to 
be zero) or that the model is incorrect is not always convincing because negative 
values do occur even when the model is correct and the components are positive. 
An alternative possibility, here and throughout this section, is to fall back on max¬ 
imum likelihood estimation or restricted MLE’s (REML estimates) obtained by 
maximizing the likelihood after first reducing the data through location invariance 
(Thompson, 1962; Corbeil and Searle, 1976). Although these methods have no 
small-sample justification, they are equivalent to a noninformative prior Bayesian 
solution (Searle et al. 1992; see also Example 2.7). Alternatively, there is an ap¬ 
proach due to Hartung (1981), who minimizes bias subject to non-negativity, or 
Pukelsheim (1981) and Mathew (1984), who find non-negative unbiased estimates 
of variance. 

(ii) Models as simple as those obtained in Examples 5.1-5.3 are not available 
when the layout is not balanced. 

The joint density of the V’s can then be obtained by noting that they are linear 
functions of normal variables and thus have a joint multivariate normal distribu¬ 
tion. To obtain it, one only need write down the covariance matrix of the V’s and 
invert it. The result is an exponential family which typically is not complete un¬ 
less the model is balanced. (This is illustrated for the one-way layout in Problem 
5.4.) UMVU estimators cannot be expected in this case (see Pukelsheim 1981). A 
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characterization of U -estimable functions permitting UMVU estimators is given 
by Unni (1978). Two general methods for the estimation of variance components 
have been developed in some detail; these are maximum and restricted maximum 
likelihood, and the minimum norm quadratic unbiased estimation (Minque) intro¬ 
duced by Rao (1970). Surveys of the area are given by Searle (1971b), Harville 
(1977), and Kleffe (1977). More detailed introductions can be found, for exam¬ 
ple, in the books by Rao and Kleffe (1988), Searle et al. (1992), and Burdick and 
Graybill (1992). 

So far, the models we have considered have had factors that were either all fixed 
or all random. We now look at an example of a mixed model, which contains both 
types of factors. 

Example 5.4 Mixed effects model. In Example 5.3, it was assumed that the hos¬ 
pital, schools, or farms were obtained as a random sample from a population of 
such units. Let us now suppose that it is only these particular hospitals that are 
of interest (perhaps it is the set of all hospitals in the city), whereas the patients 
continue to be drawn at random from these hospitals. Instead of (5.8), we shall 
assume that the observations are given by 


(5.10) Xjjk - H + a i + Bjj + Uijk CEu, - 0). 


A transformation very similar to the earlier one (Problem 5.14) now leads to 
independent normal variables Wijk with joint density proportional to 

1 1 


(5.11) 


exp 


2 (er 2 + ncrj) 


[E( Wi 




oci) 2 + Si] 


2(7 2 


with S 2 b and S 2 given by (5.9), and with Wm = \[Jn X,... This is an exponential 
family with the complete set of sufficient statistics X and S 2 . The UMVU 
estimators of and a 2 are the same as in Example 5.3, whereas the UMVU 
estimator of a,- is X,.. — X..., as it would be if the B 's were fixed. ! 


Thus far in this section, our focus has been the estimation of the variance compo¬ 
nents in random and mixed effects models. There is, however, another important 
estimation target in these models, the random effects themselves. This presents 
a somewhat different problem than is considered in the rest of this book, as the 
estimand is now a random variable rather than a fixed parameter. However, the 
theory of UMVU estimation has a fairly straightforward extension to the present 
case. We illustrate this in the following example. 

Example 5.5 Best prediction of random effects. Consider, once more, the ran¬ 
dom effects model (5.1), where the value a, of A, , the effect on gas mileage, could 
itself be of interest. 

Since a, is the realized value of a random variable rather than a fixed parameter, 
it is common to speak of prediction of a, rather than estimation of a ;. To avoid 
identifiability problems, we will, in fact, predict /i + a, rather than a,. If <S(X) is a 
predictor, then under squared error loss we have 

E[S(X) - (n + a,)] 2 = £[«5(X) ± E(yi + a, |X) - (p + a,)] 2 
= E[8(X) - E(p + ai \X)] 2 


(5.12) 
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+E[E (/x + or, |X) - (/x + a,)] 2 . 

As we have no control over the second term on the right side of (5.12), we only 
need be concerned with minimization of the first term. (In this sense, prediction 
of a random variable is the same as estimation of its conditional expected value.) 
Under the normality assumptions of Example 5.1, 


(5.13) 


E(n + Ct-i |X) : 


nat + o- 


-Xi + 


not + o- 


-ix. 


Assuming the variances known, we set 


S(X) = 


no. 


:X; + 


not + o- 


-S'(X) 


and choose <5'(X) to minimize /i |/5'(X ) — /x ] 2 . The UMVU estimator of /x is X , 
and the UMVU predictor of n + a, is 


(5.14) X, + —- -X. 

no~ A + o- no A + o l 

As we will see in Chapter 4, this predictor is also a Bayes estimator in a hierarchical 
model (which is another way of thinking of the model (5.1); see Searle et al. 1992, 
Chapter 9, and Problem 4.7.15). 

Although we have assumed normality, optimality of (5.14) continues if the 
distributional assumptions are relaxed, similar to (4.35). Under such relaxed as¬ 
sumptions, (5.14) continues to be best among linear unbiased predictors (Problem 
5.17). Harville (1976) has formulated and proved a Gauss-Markov-type theorem 
for a general mixed model. | 


6 Exponential Linear Models 

The great success of the linear models described in the previous sections suggests 
the desirability of extending these models beyond the normal case. A natural gen¬ 
eralization combines a general exponential family with the structure of a linear 
model and will often result in exponential linear models in terms of new param¬ 
eters [see, for example, (5.2) and (5.3)]. However, the models in this section are 
discrete and do not arise from normal theory. 

Equivariance tends to play little role in the resulting models; they are therefore 
somewhat out of place in this chapter. But certain analogies with normal linear 
models make it convenient to present them here. 

(i) Contingency Tables 

Suppose that the underlying exponential family is the set of multinomial distri¬ 
butions (1.5.4), which may be written as 
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and that a linear structure is imposed on the parameters p,- = log p ,. Expositions 
of the resulting theory of log linear models can be found in the books by Agresti 
(1990), Christensen (1990), Santner and Duffy (1990), and Everitt (1992). Diaconis 
(1988, Chapter 9) shows how a combination of exponential family theory and group 
representations lead naturally to log linear models. 

The models have close formal similarities with the corresponding normal mod¬ 
els, and a natural linear subspace of the log p, often corresponds to a natural 
restriction on the p’ s. In particular, since sums of the log p’s correspond to prod¬ 
ucts of the p’s, a subspace defined by setting suitable interaction terms equal to zero 
often is equivalent to certain independence properties in the multinomial model. 

The exponential family (6.1) is not of full rank since the p’s must add up to 1. 
A full-rank form is 


( 6 . 2 ) 


exp 2J Xi log(p, /p 0 ) 

i=l 


li(x). 


If we let 

(6.3) p' = log — = pi - po, 

Po 

we see that arbitrary linear functions of the pj correspond to arbitrary contrasts 
(i.e., functions of the differences) of the p,-. From Example 2.3.8, it follows that 
(Xi,... , X s ) or (Xi,, X |,.... X s ) is sufficient and complete for (6.2) and hence 
also for (6.1). In applications, we shall find (6.1) the more convenient form to use. 

If the p’s are required to satisfy r independent linear restrictions E p ; = /?, (;' = 
1 ,..., r), the resulting distributions will form an exponential family of rank ,s — r, 
and the associated minimal sufficient statistics T will continue to be complete. 
Since E(Xj/n ) = p,, the probabilities p, are always ^/-estimable; their UMVU 
estimators can be obtained as the conditional expectations of X,/n given T. If 
Pi is the UMVU estimator of p,-, a natural estimator of p, is p, = log p,, but, 
of course, this is no longer unbiased. In fact, no unbiased estimator of p,- exists 
because only polynomials of the p,- can be 1/-estimable (Problem 2.3.25). When 
Pi is also the MLE of p,, p ( - is the MLE of p,-. However, the MLE p, does not 
always coincide with the UMVU estimator p,. An example of this possibility with 
logp, = a + /3tj (f’s known; a and ji unknown) is given by Haberman (1974, 
Example 1.16, p. 29; Example 3.3, p. 60). It is a disadvantage of the p, in this case 
that, unlike p ; , they do not always satisfy the restrictions of the model, that is, for 
some values of the X’s, no a and ji exist for which log p,- = a + /31,. Typically, if 
Pi Pi • the difference between the two is moderate. 

For estimating the pGoodman (1970) has recommended in some cases apply¬ 
ing the estimators not to the cell frequencies X ,■ /n but to X, /n + 1 /2, in order to 
decrease the bias of the MLE. This procedure also avoids difficulties that may arise 
when some of the cell counts are zero. (See also Bishop, Fienberg, and Holland 
1975, Chapter 12.) 


Example 6.1 Two-way contingency table. Consider the situation of Example 
2.3.9 in which n subjects are classified according to two characteristics A and 
B with possible outcomes A\,, Aj and B \,..., Bj. If /i, ; - is the number of 
subjects with properties A, and Bj , the joint distribution of the n can be written 
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as 

n\ 

n 7 ~ v exp E E h , 7 §, 7 , |, 7 = log Pij . 

Write f , 7 = p. + a,- + /f ; - + y , 7 as in Example 4.11, with the side conditions (4.28). 
This implies no restrictions since any / J numbers f , 7 can be represented in this 
form. The p tJ must, of course, satisfy E E/?, , = 1 and the f , 7 must therefore satisfy 
E exp £, 7 = 1. This equation determines p as a function of the a’s, /S’s, and y’s 
which are free, subject only to (4.28). The UMVU estimators of the p t j were seen 
in Example 2.3.9 to be n, 7 /n. j 


In Example 4.11 (normal two-way layout), it is sometimes reasonable to suppose 
that all the y, 7 s (the interactions) are zero. In the present situation, this corresponds 
exactly to the assumption that the characteristics A and B are independent, that is, 
that Pi j = Pi+p+j (Problem 6.1). The UMVU estimator of p , 7 is now n i+ n + j/n 2 . 
Example 6.2 Conditional independence in a three-way table. InExample2.3.10, 
it was assumed that the subjects are classified according to three characteristics A , 
B , and C and that conditionally, given outcome C, the two characteristics A and 
B are independent. If f, 7 jfc = log /?, 7 /, and £, 7 * is written as 

A 




Hijk = P- + oti‘ + a“ + at + a-;“ + a‘ : f + a 




x AB 

ij 


y AC 

*ik 


BC +n/ ABC 
jk +a ijk 


with the a’s subject to the usual restrictions and with p, determined by the fact 
that the p,ji c add up to 1, it turns out that the conditional independence of A and 
B given C is equivalent to the vanishing of both the three-way interactions a ABC 
and the A, B interactions a AH (Problem 6.2). The UMVU estimators of the p, 7 j. 
in this model were obtained in Example 2.3.10. | 


(ii) Independent Binomial Experiments 


The submodels considered in Example 5.2-5.4 and 6 .1-6.2 corresponded to 
natural assumptions about the variances or probabilities in question. However, in 
general, the assumption of linearity in the 77 ’s made at the beginning of this section 
is rather arbitrary and is dictated by mathematical convenience rather than by 
meaningful structural assumptions. We shall now consider a particularly simple 
class of problems, in which this linearity assumption is inconsistent with more 
customary assumptions. Agreement with these assumptions can be obtained by 
not insisting on a linear structure for the parameters ?;,■ themselves but permitting 
a linear structure for a suitable function of the p's. 

The problems are concerned with a number of independent random variables A, 
having the binomial distributions b(pj, n,). Suppose the A’s have been obtained 
from some unobservable variables Z, distributed independently as N((j. cr 2 ) by 
setting 


0 if Z ; < 11 
1 if Z; > u. 


(6.4) 

Then 

(6.5) 


Pi = P(Zj > u) = O 


<7 
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and hence 

(6.6) Si = u +cr<t> _1 (p ! ). 

Now consider a two-way layout for the Z’s in which the effects are additive, as 
in Example 4.11. The subspace of the Sij (i = I, a, j = \..... b) defining this 
model is characterized by the fact that the interactions satisfy 


(6.7) 


Yu = Cn - Q. - C, + C = 0 


which, by ( 6 . 6 ), implies that 

(6.8) <t>-'( PlJ ) ') - 7 E W 

7 i 

» j 


The “natural” linear subspace of the parameter space for the Z’s thus translates 
into a linear subspace in terms of the parameters <J> - 1 (p,-_,-) for the X’s, and the 
corresponding fact by ( 6 . 6 ) is true quite, generally, for subspaces defined in terms 
of differences of the f’s. On the other hand, the joint distribution of the X’s is 
proportional to 


(6.9) 


exp 


E.r,- log — 

Qi J 


h(x), 


and the natural parameters of this exponential family are t] ,■ = log( p, /q,). The 
restrictions ( 6 . 8 ) are not linear in the rj’s, and the minimal sufficient statistics for 
the exponential family (6.9) with the restrictions ( 6 . 8 ) are not complete. 

It is interesting to ask whether there exists a distribution F for the underlying 
variables Z ( - such that a linear structure for the Si will result in a linear structure 
for rji = log (pi/qi) when the /?, and the ^ are linked by the equation 


(6.10) cp = P(Z, < u) = F(u - 

instead of by (6.5). Then, Si = u — F 1 (7/,j so that linear functions of the Si 
correspond to linear functions of the F~ l (q,) and hence of log( p,/q,), provided 

(6.11) F-\ qi ) = a-b log—. 

<li 

Suppressing the subscript i and putting x = a — b log (p/q), we see that (6.11) is 
equivalent to 

(6 ' 12 > g = F( * )= 1+ «^o/» ’ 

which is the cdf of the logistic distribution L(a, b ) whose density is shown in Table 
2.3.1. 

Inferences based on the assumption of linearity in <t> ~ 1 ( p j ) and log (Pi/qt) = 
F~ l (q,) with F given by (6.12) where, without loss of generality, we can take 
a = 0, b = 1, are known as probit and logit analysis , respectively, and are widely 
used analysis techniques. For more details and many examples, see Cox 1970, 
Bishop, Fienberg, and Holland 1975, or Agresti 1990. As is shown by Cox (p. 28), 
the two analyses may often be expected to give very similar results, provided the 
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p's are not too close to 0 or 1. The probit model can also be viewed as a special 
case of a threshold model, a model in which it is only observed whether a random 
variable exceeds a threshold (Finney 1971). For the calculation of the MLEs in 
this model see Problem 6.4.16. 

The outcomes of s independent binomial experiments can be represented by a 
2 xs contingency table, as in Table 3.3.1, with 1 = 2 and ./ = s, and the outcomes 
A i and A 2 corresponding to success and failure, respectively. The column totals 
n+ 1 ,..., n +s are simply the s sample sizes and are, therefore, fixed in the present 
model. In fact, this is the principal difference between the present model and that 
assumed for a 2 x / table in Example 2.3.9. The case of ,v independent binomials 
arises in the situation of that example, if the n subjects, instead of being drawn at 
random from the population at large, are obtained by drawing n+j subjects from 
the subpopulation having property Bj for j = 1 ,..., s. 

A 2 x / contingency table, with fixed column totals and with the distribution 
of the cell counts given by independent binomials, occurs not only in its own 
right through the sampling of n+ 1 ,..., n+j subjects from categories B\, ... , B r , 
respectively, but also in the multinomial situation of Example 6.1 with I = 2, as the 
conditional distribution of the cell counts given the column totals. This relationship 
leads to an apparent paradox. In the conditional model, the UMVU estimator of 
the probability p , = p\j/{p\j + pij) of success, given that the subject is in Bj, is 
Sj = itij/n+j. Since Sj satisfies 

(6.13) E(Sj\Bj) = pj, 

it appears also to satisfy E(8j) = pj and hence to be an unbiased estimator of 
p\j/{p\j + p 2 j) in the original multinomial model. On the other hand, an easy 
extension of the argument of Example 3.3.1 (see Problem 2.3.25) shows that, 
in this model, only polynomials in the p,j can be U -estimable, and the ratio in 
question clearly is not a polynomial. 

The explanation lies in the tacit assumption made in (6.13) that n+j > 0 and 
in the fact that Sj is not defined when n+j = 0. To ensure at least one observation 
in Bj, one needs a sampling scheme under which an arbitrarily large number of 
observations is possible. For such a scheme, the U -estimability of p\j/(pij + Pij) 
would no longer be surprising. 

It is clear from the discussion leading to (6.8) that the generalization of normal 
linear models to models linear in the natural parameters rp of an exponential family 
is too special and that, instead, linear spaces in suitable functions of the rp should 
be permitted. Because in exponential families the parameters of primary interest 
often are the expectations 0, = E(T)) [for example in (6.9), the p, = E(X,)\, 
generalized linear models are typically defined by restricting the parameters to 
lie in a space defined by linear conditions on v(0j) [or in some cases u,- (0,-)] for a 
suitable link function v (linking the O 's with the linear space). A theory of such 
models was developed Dempster (1971) and Nelder and Wedderburn (1972), who, 
in particular, discuss maximum likelihood estimation of the parameters. Further 
aspects are treated in Wedderburn (1976) and in Pregibon (1980). For a compre¬ 
hensive treatment of these generalized linear models, see the book by McCullagh 
and Nelder (1989), an essential reference on this topic; an introductory treatment 
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is provided by Dobson (1990). A generalized linear interactive modeling (GLIM) 
package has been developed by Baker and Nelder (1983ab). The GLIM package 
has proved invaluable in implementing these methods and has been in the center 
of much of the research and modeling (see, for example, Aitken et al. 1989). 

7 Finite Population Models 

In the location-scale models of Sections 3.1 and 3.3, and the more general linear 
models of Section 4.4 and 4.5, observations are measurements that are subject to 
random errors. The parameters to be estimated are the true values of the quantities 
being measured, or differences and other linear functions of these values, and the 
variance of the measurement errors. We shall now consider a class of problems in 
which the measurements are assumed to be without error, but in which the obser¬ 
vations are nevertheless random because the subjects (or objects) being observed 
are drawn at random from a finite population. 

Problems of this kind occur whenever one wishes to estimate the average in¬ 
come, days of work lost to illness, reading level, or the proportion of a population 
supporting some measure or candidate. The elements being sampled need not be 
human but may be trees, food items, financial records, schools, and so on. We 
shall consider here only the simplest sampling schemes. For a fuller account of 
the principal methods of sampling, see, for example, Cochran (1977); a systematic 
treatment of the more theoretical aspects is given by Cassel, Sarndal, and Wretman 
(1977) and Sarndal, Swensson, and Wretman (1992). 

The prototype of the problems to be considered is the estimation of a population 
average on the basis of a simple random sample from that population. In order 
to draw a random sample, one needs to be able to identify the members of the 
population. Telephone subscribers, for example, can conveniently be identified by 
the page and position on the page, trees by their coordinates, and students in a 
class by their names or by the row and number of their seat. In general, a list or 
other identifying description of the members of the population is called a frame. 
To represent the sampling frame, suppose that A population elements are labeled 
1 , .... A; in addition, a value a, (the quantity of interest) is associated with the 
element i. (This notation is somewhat misleading because, in any realization of 
the model, the a s will simply be A real numbers without identifying subscripts.) 
For the purpose of estimating a = T.^a,/ N, a sample of size n is drawn in order, 
one element after another, without replacement. It is a simple random sample if 
all A(A — 1)... (A — n + 1) possible n-tuples are equally likely. 

The data resulting from such a sampling process consist of the n labels of the 
sampled elements and the associated a values, in the order in which they were 
drawn, say 

(7.1) Z = {(/ 1 ,T 1 ),...,(/„,T„)} 

where the /’s denote the labels and the T’s the associated a values, Ft = a /,. The 
unknown aspect of the situation, which as usual we shall denote by 9, is the set of 
population a values of the A elements. 


(7.2) 


9 = {(1, a i), ..., (A, fliv)}- 
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In the classic approach to sampling, the labels are discarded. Let us for a moment 
follow this approach, so that what remains of the data is the set of n observed 
a values: Y\,... ,Y„. Under simple random sampling, the order statistics ra> < 
• • • < L,, ) are then sufficient. To obtain UMVU estimators of a and other functions 
of the a’s, one needs to know whether this sufficient statistic is complete. The 
answer depends on the parameter space Q, which we have not yet specified. 

It frequently seems reasonable to assume that the set V of possible values is the 
same for each of the a’ s and does not depend on the values taken on by the other 
fl’s. (This would not be the case, for example, if the a’s were the grades obtained 
by the students in a class which is being graded “on the curve.”) The parameter 
space is then the set £2 of all O' s given by (7.2) with (a\,..., a N ) in the Cartesian 
product 

(7.3) Vxyx---xV. 

Here, V may, for example, be the set of all real numbers, all positive real numbers, 
or all positive integers. Or it may just be the set V = {0, 1} representing a situation 
in which there are only two kinds of elements—those who vote yes or no, which 
are satisfactory or defective, and so on. 

Theorem 7.1 If the parameter space is given by (7.3), the order statistics F<i),..., 
T(„) are complete. 

Proof. Denote by s an unordered sample of n elements and by Y(\fs. 0), 
..., Y(„)(s, 0) its a values in increasing size. Then, the expected value of any 
estimator S depending only on the order statistics is 


(7.4) E e {S[Y ( i) 
where the summation extends over all 
simple random sampling, P(s) =1/ 

(7.5) 


F(„)]} = E/>(s)S[F (1) (s,0) 

N 
n 

N~' 

n 


, F(„)(i, 6 )], 

possible samples, and where for 
for all s. We need to show that 


£ 0 {S[F (1) , ...,F ( „)]} = 0 for all 0 e £2 


implies that <5[y ( i),..., y (n) ] = 0 for all y 0) < < 

Letusbeginby considering (7.5)for all parameter points 0 for which (a i,..., fly j 
is of the form (a, ..., a), a e V. Then, (7.5) reduces to 

P(s)S(a, ..., a) = 0 for all a, 

S 

which implies S(a , ..., a) = 0. Next, suppose that N — 1 elements in 0 are equal 
to a, and one is equal to b > a. Now, (7.5) will contain two kinds of terms: 
those corresponding to samples consisting of n a’s and those in which the sample 
contains b, and (7.5) becomes 


p S(a, ..., a) + qS(a, ... , a, b) = 0 

where p and q are known numbers f 0. Since the first term has already been shown 
to be zero, it follows that <$(a, ..., a, b) = 0. Continuing inductively, we see that 
S(a, ..., a, /?,...,/?) = 0 for any k fl’s and n — k b’ s, k = 0 . n. 
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As the next stage in the induction argument, consider 0’s of the form 
(a,..., a, b, c ) with a < b < c, then 0’s of the form (a,..., a, b , b , c), and so 
on, showing successively that S(a,..., a, b, c ), S(a, ..., a, b, b, c),... are equal 
to zero. Continuing in this way, we see that <5[y(i),..., y( n) ] = 0 for all possible 
, }’(„)), and this proves completeness. □ 

It is interesting to note the following: 

(a) No use has been made of the assumption of simple random sampling, so that 
the result is valid also for other sampling methods for which the probabilities 
P(s ) are known and positive for all .v. 

(b) The result need not be true for other parameter spaces f2 (Problem 7.1). 

Corollary 7.2 On the basis of the sample values Y \,..., F„, a UMVU estima¬ 
tor exists for any U-estimable function of the a’s, and it is the unique unbiased 
estimator S( Y \, ..., Y n ) that is symmetric in its n arguments. 

Proof The result follows from Theorem 2.1.11 and the fact that a function of 
yi,... ,y„ depends only on y, |,,..., y (ll) if and only if it is symmetric in its n 
arguments (see Section 2.4). □ 

Example 7.3 UMVU estimation in simple random sampling. If the sampling 
method is simple random sampling and the estimand is ci, the sample mean Y is 
clearly unbiased since £(K,) = a for all i (Problem 7.2). Since Y is symmetric in 
Fi,..,, Y n , it is UMVU and among unbiased estimators, it minimizes the risk for 
any convex loss function. The variance of F is (Problem 7.3) 


(7.6) 


var(F) = 


N — n 


N - 1 



n 


where 

(7.7) 


1 

N 


S(a, - a) 2 


is the population variance. To obtain an unbiased estimator of r 2 , note that (Prob¬ 
lem 7.3) 


(7.8) 


1 


S(F,- - F) 2 


N 


N - 1 


Thus, [(A — 1 )/N(n — 1)]£" =1 (F; — F) 2 is unbiased for r 2 , and because it is 
symmetric in its n arguments, it is UMVU. 


If the sampling method is sequential, the stopping rule may add an additional 
complication. 

Example 7.4 Sum-quota sampling. Suppose that each F; has associated with it a 
cost Cj, a positive random variable, and sampling is continued until v observations 
are taken, where C, <Q< \ C,. with Q being a specified quota. (Note the 

similarity to inverse binomial sampling, as discussed in Example 2.3.2.) Under this 
sampling scheme, Pathak (1976) showed that F„_] = Xa ' i s an unbiased 
estimator of the population average a (Problem 7.4). 

Note that Pathak’s estimator drops the terminal observation Y v . which tends to 
be upwardly biased. As a consequence, Pathak’s estimator can be improved upon. 
This was done by Kremers (1986), who showed the following: 
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(a) T = {(Ci, Fi),..., (C v , Y v )} is complete sufficient. 

(b) Conditional on T, {(C,-, Fi), .... (C„_i, F y _i)} are exchangeable (Problem 
7.5). 

Under these conditions, the estimator 

(7.9) a = Y — (Y [v] - Y)/{v - 1) 

is UMVU if v > 1, where F[ y ] is the mean of all of the observations that could 
have been the terminal observation; that is, F[ v ] is the mean of all the F, ’s in the 
set 

(7.10) {(Cj,yj):J2 c i < QJ = 1.v}. 

>¥i 

See Problem 7.6. II 


So far, we have ignored the labels. That Theorem 7.1 and Corollary 7.2 no longer 
hold when the labels are included in the data is seen by the following result. 


Theorem 7.5 Given any sampling scheme of fixed size n which assigns to the 
sample s a known probability P(s) (which may depend on the labels but not on the 
a values of the sample), given any U -estimable function g(9), and given any pre¬ 
assigned parameter point 9 q = {(1, aio), ..., (N, a^o)}, there exists an unbiased 
estimator 8* of g(9) with variance vare 0 (i5*) = 0. 


Proof. Let 8 be any unbiased estimator of g(0), which may depend on both labels 
and y values, say 

5(s) = 5[(i 1 ,yi), ...,(i„, y n )]. 


and let 


<5qCs) = 8[(ii,a il0 ),..., (i n , a in0 )]. 


Note that So depends on the labels whether or not 8 does and thus would not be 
available if the labels had been discarded. Let 


«*(s) = 5(5) - «o(5) + g(9o). 

Since 

E e (8) = g(9) and E e (8 Q ) = g(9 0 ), 

it is seen that 8* is unbiased for estimating g(9). When 9 = 6q, 8* = g(9o) and is 
thus a constant. Its variance is therefore zero, as was to be proved. □ 

To see under what circumstances the labels are likely to be helpful and when it 
is reasonable to discard them, let us consider an example. 

Example 7.6 Informative labels. Suppose the population is a class of several 
hundred students. A random sample is drawn and each of the sampled students is 
asked to provide a numerical evaluation of the instructor. (Such a procedure may be 
more accurate than distributing reaction sheets to the whole class, if for the much 
smaller sample it is possible to obtain a considerably higher rate of response.) 
Suppose that the frame is an alphabetically arranged class list and that the label is 
the number of the student on this list. Typically, one would not expect this label 
to carry any useful information since the place of a name in the alphabet does not 
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usually shed much light on the student’s attitude toward the instructor. (Of course, 
there may be exceptional circumstances that vitiate this argument.) On the other 
hand, suppose the students are seated alphabetically. In a large class, the students 
sitting in front may have the advantage of hearing and seeing better, receiving more 
attention from the instructor, and being less likely to read the campus newspaper 
or fall asleep. Their attitude could thus be affected by the place of their name in 
the alphabet, and thus the labels could carry some information. 

We shall discuss two ways of formalizing the idea that the labels can reasonably 
be discarded if they appear to be unrelated to the associated a values. 

(i) Invariance. Consider the transformations of the parameter and sample space 
obtained by an arbitrary permutation of the labels: 

(7.11) = .wa 

gx = {(7(/i), Fi),..., 0(4), Y n )}. 

The estimand ci [or, more generally, any function h(ci \,..., fly) that is symmetric 
in the a’s] is unchanged by these transformations, so that g*d = d and a loss 
function L(6 , d) is invariant if it depends on 0 only through the a’s (in fact, as a 
symmetric function of the a's) and not the labels. [For estimating d, such a loss 
function would be typically of the form p(d — a).] Since g*d = d, an estimator <5 
is equivariant if it satisfies the condition 

(7.12) S(gX) = S(X) for all g and X. 

In this case, equivariance thus reduces to invariance. Condition (7.12) holds if and 
only if the estimator S depends only on the observed Y values and not on the labels. 
Combining this result with Corollary 7.2, we see that for any U -estimable function 
h{a\ ,..., ay), the estimator of Corollary 7.2 uniformly minimizes the risk for any 
convex loss function that does not depend on the labels among all estimators of h 
which are both unbiased and invariant. 

The appropriateness of the principle of equivariance, which permits restricting 
consideration to equivariant (in the present case, invariant) estimators, depends on 
the assumption that the transformations (7.11) leave the problem invariant. This 
is clearly not the case when there is a relationship between the labels and the 
associated a values, for example, when low a values tend to be associated with 
low labels and high a values with high labels, since permutation of the labels will 
destroy this relationship. Equivariance considerations therefore justify discarding 
the labels if, in our judgment, the problem is symmetric in the labels, that is, 
unchanged under any permutation of the labels. 

(ii) Random labels. Sometimes, it is possible to adopt a slightly different formu¬ 
lation of the model which makes an appeal to equivariance unnecessary. Suppose 
that the labels have been assigned at random, that is, so that all A'! possible as¬ 
signments are equally likely. Then, the observed a values Y\,... ,Y n are sufficient. 
To see this, note that given these values, any n labels (7), ..., /„) associated with 
them are equally likely, so that the conditional distribution of X given (Y\, ..., Y n ) 
is independent of 9. In this model, the estimators of Corollary 7.2 are, therefore, 
UMVU without any further restriction. 
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Of course, the assumption of random labeling is legitimate only if the labels 
really were assigned at random rather than in some systematic way such as alpha¬ 
betically or first come, first labeled. In the latter cases, rather than incorporating a 
very shaky assumption into the model, it seems preferable to invoke equivariance 
when it comes to the analysis of the data with the implied admission that we be¬ 
lieve the labels to be unrelated to the a values but without denying that a hidden 
relationship may exist. 

Simple random sampling tends to be inefficient unless the population being 
sampled is fairly homogeneous with respect to the a’s. To see this, suppose that 
a i = • • • = qjvj = a and a^ l+ \ = ■ ■ ■ = a Wl+A?2 = b(N\ + N2 = N). Then (Problem 
7.3) 

N — n y(l — y) , 

(7.13) var (T) = —— • ---( b-a ) 2 

N - l n 

where y = N\ / N. On the other hand, suppose that the subpopulations If,- consisting 
of the a’s and /;’s, respectively, can be identified and that one observation X, is taken 
from each of the IT; (i = 1,2). Then X\ = a and X 2 = b and (N\X\+N 2 X 2 )/N = a 
is an unbiased estimator of a with variance zero. 

This suggests that rather than taking a simple random sample from a hetero¬ 
geneous population n, one should try to divide n into more homogeneous sub¬ 
populations n,, called strata, and sample each of the strata separately. Human 
populations are frequently stratified by such factors as age, gender, socioeconomic 
background, severity of disease, or by administrative units such as schools, hospi¬ 
tals, counties, voting districts, and so on. 

Suppose that the population n has been partitioned into s strata TT 1 , ..., Fk of 
sizes Ni,..., N s and that independent simple random samples of size n, are taken 
from each TT,- (i = 1,..., s). If a ,7 (j = 1,..., N,) denote the a values in the 1 th 
stratum, the parameter is now 9 = (9\ . 9 S ), where 

9i = {(1, a n ), • • ■, (Ni,a iNi y,i], 

and the observations are X = (Xi, ..., X s ), where 

X i ={(K iU Y n ),...,(K ini ,Y ini y,i}. 

Here, K :J is the label of the /th element drawn from n, and K, ; is its a value. 

It is now easy to generalize the optimality results for simple random sampling 
to stratified sampling. 

Theorem 7.7 Let the Y,j (j = 1,..., «,), ordered separately for each i, be denoted 
by T/(i) < ••• < Yi(i U y On the basis of the Yjj (i.e., without the labels), these ordered 
sample values are sufficient. They are also complete if the parameter space £2; for 
9i is of the form Vj x • • • x V) (TV,- factors) and the overall parameter space is 
£2 = £2i x ■ ■ • x £2 S . (Note that the value sets V; may be different for different 
strata.) 

The proof is left to the reader (Problem 7.9). 

It follows from Theorem 7.7 that on the basis of the T’s, a UMVU estimator exists 
for any U -estimator function of the a’s and that it is the unique unbiased estimator 



204 


EQUIVARIANCE 


[ 3.7 


(5(7,1, ■ • • • Y\ n .\ Yo],..., Y 2 n 2 ; ■ . •) which is symmetric in its first ni arguments, 
symmetric in its second set of «2 arguments, and so forth. 

Example 7.8 UMVU estimation in stratified random sampling. Suppose that 
we let a.. = E Ea l; / N be the average of the «’s for the population n. If a,, is the 
average of the a's in If,, F,-. is unbiased for estimating a,, and hence 

NY- 

(7.14) 

N 


is an unbiased estimator of a... Since 8 is symmetric for each of the .y subsamples, 
it is UMVU for a., on the basis of the F’s. From (7.6) and the independence of the 
F, ’s, it is seen that 


(7.15) 


var(5) = 



Ni - h,- 
N — 1 


1 

-T; 

n 


where r ( 2 is the population variance of fl,-, and from (7.8), one can read off the 
UMVU estimator of (7.15). || 


Discarding the labels within each stratum (but not the strata labels) can again 
be justified by invariance considerations if these labels appear to be unrelated to 
the associated a values. Permutation of the labels within each stratum then leaves 
the problem invariant, and the condition of equivariance reduces to the invariance 
condition (7.12). In the present situation, an estimator again satisfies (7.12) if and 
only if it does not depend on the within-strata labels. The estimator (7.14), and 
other estimators which are UMVU when these labels are discarded, are therefore 
also UMVU invariant without this restriction. 

A central problem in stratified sampling is the choice of the sample sizes n,-. 
This is a design question and hence outside the scope of this book (but see Hedayat 
and Sinha 1991). We only mention that a natural choice is proportional allocation, 
in which the sample sizes n; are proportional to the population sizes N). If the r, 
are known, the best possible choice in the sense of minimizing the approximate 
variance 

(7.16) HiNrxf/niN 2 ) 

is the Tschuprow-Neyman allocation with n, proportional to /V, r, (Problem 7.11). 

Stratified sampling, in addition to providing greater precision for the same total 
sample size than simple random sampling, often has the advantage of being admin¬ 
istratively more convenient, which may mean that a larger sample size is possible 
on the same budget. Administrative convenience is the principal advantage of a 
third sampling method, cluster sampling, which we shall consider next. The pop¬ 
ulation is divided into K clusters of sizes M\,..., Mk■ A single random sample 
of k clusters is taken and the a values of all the elements in the sampled clusters 
are obtained. The clusters might, for example, be families or city blocks. A field 
worker obtaining information about one member of a family can often obtain the 
same information for all the members at relatively little additional cost. 

An important special case of cluster sampling is systematic sampling. Suppose 
the items on a conveyor belt or the cards in a card catalog are being sampled. The 
easiest way of drawing a sample in these cases and in many situations in which the 
sampling is being done in the field is to take every rth element, where r is some 
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positive number. To inject some randomness into the process, the starting point is 
chosen at random. Here, there are r clusters consisting of the items labeled 

{1, r + 1, 2r + 1,...}, [2, r + 2, 2r + 2, ...},..., [r, 2r, 3 r,...}, 

of which one is chosen at random, so that K = r and k = 1. In general, let the 
elements of the ith cluster be [an ,..., a,m,} and let w,- = £ a, ; - be the total for 
the /th cluster. We shall be interested in estimating some function of the u’s such as 
the population average a.. = Eui/EM/. Of the a tJ , we shall assume that the vector 
of values (a,- i,..., a,- m, ) belongs to some set Wj (which may, but need not be, of the 
form V x • • • x V) and that (a 11; ..., ai Ml \ A 21 , ..., a 2 m 2 ; ...) e Wj x • • • x W K . 
The observations consist of the labels of the clusters included in the sample together 
with the full set of labels and values of the elements of each such cluster: 

X = {[/ 1 ; (1, a iiA ), (2, a, 1>2 ), ...]; [i 2 ; (1, a,- 2 ,i), (2 , a, 2j2 )—];...}. 

Let us begin the reduction of the statistical problem with invariance consider¬ 
ations. Clearly, the problem remains invariant under permutations of the labels 
within each cluster, and this reduces the observation to 

X' = {[/ 1 , (Ai',,1,.. .,a hi M h )\: [k, (fl; 2 , 1 , ■ • ■, Ai 2 ,m, 2 )] ; • • ■} 

in the sense that an estimator is invariant under these permutations if and only if it 
depends on X only through X'. 

The next group is different from any we have encountered so far. Consider any 
transformation taking (an,, a,m,) into (a ' n ,..., a' iM ), i = 1, .... K, where the 
ok are arbitrary, except that they must satisfy 

(a) (a' n , ...,a' iM .) e Wj 

and 

Mi 

(b) J2 a ij = Ui - 

7=1 

Note that for some vectors (an,..., a,), there may be no such transformations 
except the identity; for others, there may be just the identity and one other, and so 
on, depending on the nature of Wj. 

It is clear that these transformations leave the problem invariant, provided both 
the estimand and the loss function depend on the a’s only through the m’s. Since the 
estimand remains unchanged, the same should then be true for S, which, therefore, 
should satisfy 

(7.17) S(gX') = S(X') 

for all these transformations. It is easy to see (Problem 7.17) that S satisfies (7.17) 
if and only if <5 depends on X' only through the observed cluster labels, cluster 
sizes, and the associated cluster totals, that is, only on 

(7.18) X" = {(/,-, u h ,M h ), ..., (4, u ik , M ik )} 
and the order in which the clusters were drawn. 



206 


EQUIVARIANCE 


[ 3.7 


This differs from the set of observations we would obtain in a simple random 
sample from the collection 

(7.19) {(1, 

through the additional observations provided by the cluster sizes. For the estimation 
of the population average or total, this information may be highly relevant and the 
choice of estimator must depend on the relationship between M, and m,-. The 
situation does, however, reduce to that of simple random sampling from (7.19) 
under the additional assumption that the cluster sizes M, are equal, say M, = M, 
where M can be assumed to be known. This is the case, either exactly or as a very 
close approximation, for systematic sampling, and also in certain applications to 
industrial, commercial, or agricultural sampling—for example, when the clusters 
are cartons of eggs of other packages or boxes containing a fixed number of items. 
From the discussion of simple random sampling, we know that the average Y of 
the observed u values is then the UMVU invariant estimator it = Eh, /K and hence 
that Y/M is UMVU invariant for estimating a... The variance of the estimator is 
easily obtained from (7.6) with r 2 = E(m, — u) 2 / K. 

In stratified sampling, it is desirable to have the strata as homogeneous as possi¬ 
ble: The more homogeneous a stratum, the smaller the sample size it requires. The 
situation is just the reverse in cluster sampling, where the whole cluster will be 
observed in any case. The more homogeneous a cluster, the less benefit is derived 
from these observations: “If you have seen one, you have seen them all.” Thus, it 
is desirable to have the clusters as heterogeneous as possible. For example, fam¬ 
ilies, for some purposes, constitute good clusters by being both administratively 
convenient and heterogeneous with respect to age and variables related to age. 
The advantages of stratified sampling apply not only to the sampling of single ele¬ 
ments but equally to the sampling of clusters. Stratified cluster sampling consists 
of drawing a simple random sample of clusters from each stratum and combining 
the estimates of the strata averages or totals in the obvious way. The resulting 
estimator is again UMVU invariant, provided the cluster sizes are constant within 
each stratum, although they may differ from one stratum to the next. (For a more 
detailed discussion of stratified cluster sampling, see, for example, Kish 1965.) 

To conclude this section, we shall briefly indicate two ways in which the equiv- 
ariance considerations in the present section differ from those in the rest of the 
chapter. 

(i) In all of the present applications, the transformations leave the estimand un¬ 
changed rather than transforming it into a different value, and the condition of 
equivariance then reduces to the invariance condition: S(gX) = S(X). Correspond¬ 
ingly, the group G is not transitive over the parameter space and a UMRE estimator 
cannot be expected to exist. To obtain an optimal estimator, one has to invoke un¬ 
biasedness in addition to invariance. (For an alternative optimality property, see 
Section 5.4.) 

(ii) Instead of starting with transformations of the sample space which would 
then induce transformations of the parameter space, we inverted the order and 
began by transforming 0, thereby inducing transformations of X. This does not 
involve a new approach but was simply more convenient than the usual order. To 
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see how to present the transformations in the usual order, let us consider the sample 
space as the totality of possible samples s together with the labels and values of 
their elements. Suppose, for example, that the transformations are permutations 
of the labels. Since the same elements appear in many different samples, one 
must ensure that the transformations g of the samples are consistent, that is, that 
the transform of an element is independent of the particular sample in which it 
appears. If a transformation has this property, it will define a permutation of all 
the labels in the population and hence a transformation g of 0. Starting with g or 
g thus leads to the same result; the latter is more convenient because it provides 
the required consistency property automatically. 


8 Problems 


Section 1 


1.1 Prove the parts of Theorem 1.4 relating to (a) risk and (b) variance. 

1.2 In model (1.9), suppose that n = 2 and that / satisfies f(—x i, — X 2 ) = f(x 2 ,x 1 ). 
Show that the distribution of (X t + X 2 )/2 given X 2 — Xi = y is symmetric about 0. 
Note that if X, and X 2 are iid according to a distribution which is symmetric about 0, 
the above equation holds. 

1.3 If X\ and X 2 are distributed according to (1.9) with n = 2 and / satisfying the 
assumptions of Problem 1.2, and if p is convex and even, then the MRE estimator of | 
is (Xi + X 2 )/2. 

1.4 Under the assumptions of Example 1.18, show that (a) £[X (1 )] = b/n and (b) 
med[X ( i,] = blog2/n. 

1.5 For each of the three loss functions of Example 1.18, compare the risk of the MRE 
estimator to that of the UMVU estimator. 

1.6 If T is a sufficient statistic for the family (1.9), show that the estimator (1.28) is a 
function of T only. [Hint: Use the factorization theorem.] 

1.7 Let Xi(i = 1,2, 3) be independently distributed with density /(x, — §) and let <5 = X, 
if X 3 > 0 and = X 2 if X 2 < 0. Show that the estimator S of £ has constant risk for any 
invariant loss function, but <5 is not location equivariant. 

1.8 Prove Corollary 1.14. [Hint: Show that (a) tp(v ) = E Q p(X — v) —> M as v —> ±00 
and (b) that (f> is continuous; (b) follows from the fact (see TSH2, Appendix Section 2) 
that if f n , n = 1, 2 ,... and / are probability densities such that f„{x) —> f(x) a.e., then 
/ i'fn —■* f 'Iff f°r any bounded r/r.] 

1.9 Let X\ . X„ be distributed as in Example 1.19 and let the loss function be that of 

Example 1.15. Determine the totality of MRE estimators and show that the midrange is 
one of them. 


1.10 Consider the loss function 


Pit) = 


-At 

Bt 


if t < 0 
if t > 0 


(A, B > 0). 


If X is a random variable with density / and distribution function F, show that Ep(X — v ) 
is minimized for any v satisfying F(v) = B/(A + B). 

1.11 In Example 1.16, find the MRE estimator of $ when the loss function is given by 
Problem 1.10. 
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1.12 Show that an estimator S(X) of g(9) is risk-unbiased with respect to the loss function 
of Problem 1.10 if Fg[g{9)\ = B/(A + B), where Fg is the cdf of <5(X) under 8. 

1.13 Suppose X \...., X,„ and Yi, ... ,Y„ have joint density fix i — £. x m — yi — 

ij . y„ — if) and consider the problem of estimating A = tj — t-. Explain why it is 

desirable for the loss function L(f, t]\ d ) to be of the form p(d — A) and for an estimator 
S of A to satisfy S(x + a, y + b) = S(x, y) + (b — a). 

1.14 Under the assumptions of the preceding problem, prove the equivalents of Theorems 
1.4-1.17 and Corollaries 1.11-1.14 for estimators satisfying the restriction. 

1.15 In Problem 1.13, determine the totality of estimators satisfying the restriction when 
m = n = 1. 

1.16 In Problem 1.13, suppose the X’s and F’s are independently normally distributed 
with known variances a 2 and r 2 . Find conditions on p under which the MRE estimator 
is Y - X. 

1.17 In Problem 1.13, suppose the X’s and F’s are independently distributed as £(f, 1) 
and E(p, t), respectively, and that m = n. Find conditions on p under which the MRE 
estimator of A is r<i> - x w . 

1.18 In Problem 1.13, suppose that X and Y are independent and that the loss function 
is squared error. If § and rj are the MRE estimators of § and r], respectively, the MRE 
estimator of A is fj — f. 

1.19 Suppose the X’s and F’s are distributed as in Problem 1.17 but with m f n. Deter¬ 
mine the MRE estimator of A when the loss is squared error. 

1.20 For any density / of X = (Xi, ..., X n ), the probability of the set A = (x : 0 < 
f /(x — u) du < oo) is 1. [Hint: With probability 1, the integral in question is equal 
to the marginal density of Y = (Fj,..., F„_|) where F ; = X, — X n , and P[0 < g(Y) < 
oo] = 1 holds for any probability density g.] 

1.21 Under the assumptions of Theorem 1.10, if there exists an equivariant estimator So 
of | with finite expected squared error, show that 

(a) Eo(\X n \ | Y) < oo with probability 1; 

(b) the set B = {x : f \u\f(x — u)du < oo) has probability 1. 

[Hint: (a) E^ol < oo implies £(|5o| | Y) < oo with probability 1 and hence ii[<5o — 
u(Y)| [ Y] < oo with probability 1 for any u(Y). (b) P(B) = 1 if and only if E(\X n \ \ 
Y) < oo with probability 1 .] 

1.22 Let S 0 be location equivariant and let U be the class of all functions u satisfying 
(1.20) and such that u(X) is an unbiased estimator of zero. Then, <5o is MRE if and only 
if cov[5 0 , u(X)\ = 0 for all u eU r (Note the analogy with Theorem 2.1.7. ) 


Section 2 

2.1 Show that the class G(C) is a group. 

2.2 In Example 2.2(ii), show that the transformations x' = —x together with the identity 
transformation form a group. 

2.3 Let (gX, g e Gj be a group of transformations that leave the model (2.1) invariant. 
If the distributions Pg, 9 e £2 are distinct, show that the induced transformations g are 
1 : 1 transformations of Q. [Hint: To show that g9i = g9i implies 9 t = 9i, use the fact 
that Pg j(A) = Pg n (A) for all A implies 9i = 6 b.] 

2 Communicated by P. Bickel. 
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2.4 Under the assumptions of Problem 2.3, show that 

(a) the transformations g satisfy gjgi = g2 ■ gi and (gU 1 = (g -1 ); 

(b) the transformations g corresponding tog e G form a group. 

(c) establish (2.3) and (2.4). 

2.5 Show that a loss function satisfies (2.9) if and only if it is of the form (2.10). 

2.6 (a) The transformations g* defined by (2.12) satisfy (g2gi)* = g* ■ g* and (g*)" 1 = 

(g ')'• 

(b) If G is a group leaving (2.1) invariant and G* = {g*, g e G), then G* is a group. 

2.7 Let X be distributed as (V(f, a 2 ), — oo < f < oo, 0 < cr. and let /;(£, <r) = a 2 . The 
problem is invariant under the transformations x' = ax+c; 0 < a , —oo < c < oo.Show 
that the only equivariant estimator is 8(X) = 0. 

2.8 Show that: 

(a) If (2.11) holds, the transformations g* defined by (2.12) are 1: 1 from % onto itself. 

(b) If L(9, d) = L(9, d') for all 9 implies d = d' , then g* defined by (2.14) is unique, 
and is a 1: 1 transformation from T> onto itself. 

2.9 If 9 is the true temperature in degrees Celsius, then 9' = g9 = 9 + 273 is the true 
temperature in degrees Kelvin. Given an observation X, in degrees Celsius: 

(a) Show that an estimator <5(X) is functionally equivariant if it satisfies S(.r) + a = 
S(x + a) for all a. 

(b) Suppose our estimator is <5(.r) = (ax + b9o)/(a + b), where x is the observed tem¬ 
perature in degrees Celsius, # 0 is a prior guess at the temperature, and a and b are 
constants. Show that for a constant K, S(x + K) ^ S(x) + K, so 8 does not satisfy 
the principle of functional equivariance. 

(c) Show that the estimators of part (b) will not satisfy the principle of formal invariance. 

2.10 To illustrate the difference between functional equivariance and formal invariance, 
consider the following. 

To estimate the amount of electric power obtainable from a stream, one could use the 
estimate 

8(x) = c min{ 100, x — 20} 

where ,y = stream flow in m 3 /sec, 100 m 3 /sec is the capacity of the pipe leading to the 
turbine, and 20 m 3 /sec is the flow reduction necessary to avoid harming the trout. The 
constant c, in kilowatts /m 3 /sec converts the flow to a kilowatt estimate. 

(a) If measurements were, instead, made in liters and watts, so g(x) = lOOOv and 
g(9) = 1000 9, show that functional equivariance leads to the estimate 

g(8(x)) = cminjlO 5 , g(x) - 20, 000}. 

(b) The principle of formal invariance leads to the estimate 8(g(x)). Show that this 
estimator is not a reasonable estimate of wattage. 

(Communicated by L. LeCam.) 

2.11 In an invariant probability model, write X = ( T , W), where T is sufficient for 9, 
and W is ancillary . 

(a) If the group operation is transitive, show that any invariant statistic must be ancillary. 

(b) What can you say about the invariance of an ancillary statistic? 
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2.12 In an invariant estimation problem, write X = ( T , W) where T is sufficient for 9, 
and W is ancillary. If the group of transformations is transitive, show: 

(a) The best equivariant estimator 5* is the solution to min,; Eg[L(0, d(x))\ W = w], 

(b) If e is the identity element of the group (g~ l g = e), then 8* = 8*(t, w) can be found 
by solving, for each w, min (i E e [L[e, d(T , tu)]| W = w}. 

2.13 For the situation of Example 2.11: 

(a) Show that the class of transformations is a group. 

(b) Show that equivariant estimators must satisfy 8(n — jc) =1 — <5(.r). 

(c) Show that, using an invariant loss, the risk of an equivariant estimator is symmetric 
about p = 1/2. 

2.14 For the situation of Example 2.12: 

(a) Show that the class of transformations is a group. 

(b) Show that estimators of the form <p(x/s 2 )s 2 , where x = 1/nXx, ands 2 = E(x ; — x) 2 
are equivariant, where <p is an arbitrary function. 

(c) Show that, using an invariant loss function, the risk of an equivariant estimator is a 
function only of r = p/a. 

2.15 Prove Corollary 2.13. 

2.16 (a) If g is the transformation (2.20), determine g. 

(b) In Example 2.12, show that (2.22) is not only sufficient for (2.14) but also necessary. 

2.17 (a) In Example 2.12, determine the smallest group G containing both Gi and Gi. 
(b) Show that the only estimator that is invariant under G is <5(X, Y) = 0. 

2.18 If 8(X ) is an equivariant estimator of h(6) under a group G, then so is g*8(X) with 
g* defined by (2.12) and (2.13), provided G* is commutative. 

2.19 Show that: 

(a) In Example 2.14(i), X is not risk-unbiased. 

(b) The group of transformations ax + c of the real line (0 < a, —oo < c < oo) is not 
commutative. 

2.20 In Example 2.14, determine the totality of equivariant estimators of A under the 
smallest group G containing G i and G2. 

2.21 Let 9 be real-valued and h strictly increasing, so that (2.11) is vacuously satisfied. 
If L(9, d) is the loss resulting from estimating 9 by d , suppose that the loss resulting 
from estimating 8' = h(9) by d' = h(d) is M(9', d') = L[9, h~ l (d')]. Show that: 

(a) If the problem of estimating 9 with loss function L is invariant under G, then so is 
the problem of estimating h(9) with loss function M. 

(b) If <5 is equivariant under G for estimating 9 with loss function L, show that /z [<5(JY)] 
is equivariant for estimating h(9) with loss function M. 

(c) If 5 is MRE for 9 with L, then h[8(X)] is MRE for h(9) with M. 


2.22 If S(X) is MRE for estimating f in Example 2.2(i) with loss function p(d — f), state 
an optimum property of e S(X) as an estimator of . 
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2.23 Let Xjj, j = 1, ..., i = 1,..., s, and IV be distributed according to a density of 
the form 


where x ; -= (x n 
with loss function L(^ it 


n fit* - ?-■) 


h(u>) 


Li=i J 

.., x ini — £,•), and consider the problem of estimating 9 = Ecjft 
.. ,l- s \d) = p(d — 9). Show that: 


(a) This problem remains invariant under the transformations 

X'ij = Xjj + Clj. = q, + Ctj , O' = 9 + Efl,C,', 

d' = d + Ea ; C;. 


(b) An estimator S of 6 is equivariant under these transformations if 

<5(xj + ai,..., x s + a s , w) = <5(xj,..., x s , w) + XciiCi. 


2.24 Generalize Theorem 1.4 to the situation of Problem 2.23. 

2.25 If So is any equivariant estimator of 9 in Problem 2.23, and if y, = (xn — x mi , xn — 
Xj ni , ..., x irtj _] — x inj ), show that the most general equivariant estimator of 6 is of the 
form 

<5(xi, ...,x s ,w) = S 0 (x u ... ,Xj, w) - u(yi,..., y s , w ). 

2.26 (a) Generalize Theorem 1.10 and Corollary 1.12 to the situation of Problems 2.23 
and 2.25. (b) Show that the MRE estimators of (a) can be chosen to be independent of 
IV. 

2.27 Suppose that the variables X tj in Problem 2.23 are independently distributed as 
N(£i , a 2 ), a is known. Show that: 

(a) The MRE estimator of 6 is then Ec, X t — v*, where X t = (Z,i + • • • + X in .)/nj, and 
where v* minimizes (1.24) with X = Ec, X t . 

(b) If p is convex and even, the MRE estimator of 9 is Ec, X ; . 

(c) The results of (a) and (b) remain valid when a is unknown and the distribution of 
IV depends on a (but not the §’s). 

2.28 Show that the transformation of Example 2.11 and the identity transformation are 
the only transformations leaving the family of binomial distributions invariant. 


Section 3 

3.1 (a) A loss function L satisfies (3.4) if and only if it satisfies (3.5) for some y. 

(b) The sample standard deviation, the mean deviation, the range, and the MLE of r 
all satisfy (3.7) with r = 1. 

3.2 Show that if <5(X) is scale invariant, so is <5*(X) defined to be <5(X) if S(X) > 0 and 
= 0 otherwise, and the risk of <5* is no larger than that of 5 for any loss function (3.5) for 
which y(v) is nonincreasing for v < 0. 

3.3 Show that the bias of any equivariant estimator of x r in (3.1) is proportional to r r . 

3.4 A necessary and sufficient condition for <5 to satisfy (3.7) is that it is of the form 
5 = S 0 /u with S 0 and u satisfying (3.7) and (3.9), respectively. 

3.5 The function p of Corollary 3.4 with y defined in Example 3.5 is strictly convex for 
P > 1 - 

3.6 Let A be a positive random variable. Show that: 
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(a) If EX 2 < oo, then the value of c that minimizes E(X/c — l) 2 is c = EX 2 /EX. 

(b) If Y has the gamma distribution with r(o', 1), then the value of w minimizing 
E[(Y/w)~ l] 2 istu =a + I. 

3.7 Let Xbea positive random variable. 

(a) If EX < oo, then the value of c that minimizes E\X/c — 1| is a solution to 
EXI(X < c) = EXI(X > c), which is known as a scale median. 

(b) Let Y have a / 2 -distribution with / degrees for freedom. Then, the minimizing 
value is w = / + 2. [Hint: (b) Example 1.5.9.] 

3.8 Under the assumptions of Problem 3.7(a), the set of scale medians of X is an interval. 
If fix) > 0 for all x > 0, the scale median of X is unique. 

3.9 Determine the scale median of X when the distribution of X is (a) U (0, 9) and (b) 
E( 0, b). 

3.10 Under the assumptions of Theorem 3.3: 

(a) Show that the MRE estimator under the loss (3.13) is given by (3.14). 

(b) Show that the MRE estimator under the loss (3.15) is given by (3.11). where w*(z) 
is any scale median of <5 0 (x) under the distribution of X|Z. 

[Hint: Problem 3.7.] 

3.11 Let X\, ... , X n be iid according to the uniform distribution u( 0, 9). 

(a) Show that the complete sufficient statistic X fn) is independent of Z [given by Equa¬ 
tion (3.8)]. 

(b) For the loss function (3.13) with r = 1, the MRE estimator of 9 is X (n )/w , with 
w = (n + 1 )/(n + 2). 

(c) For the loss function (3.15) with r = 1, the MRE estimator of 9 is [2 1/( " +1) ] X (n) . 

3.12 Show that the MRE estimators of Problem 3.11, parts (b) and (c), are risk-unbiased, 
but not mean-unbiased. 

3.13 In Example 3.7, find the MRE estimator of var(Xi) when the loss function is (a) 
(3.13) and (b) (3.15) with r = 2. 

3.14 Let Xi, ..., X„ be iid according to the exponential distribution £(0, r). Determine 
the MRE estimator of r for the loss functions (a) (3.13) and (b) (3.15) with r = 1. 

3.15 In the preceding problem, find the MRE estimator of var(A j) when the loss function 
is (3.13) with r = 2. 

3.16 Prove formula (3.19). 

3.17 Let Xi, ..., X„ be iid each with density (2/r)[l — (jc/t)], 0 < x < r. Determine 
the MRE estimator (3.19) of r r when (a) n = 2, (b) n = 3, and (c) n = 4. 

3.18 In the preceding problem, find var(Xi) and its MRE estimator for n = 2, 3, 4 when 
the loss function is (3.13) with r = 2. 

3.19 (a) Show that the loss function L s of (3.20) is convex and invariant under scale 
transformations. 

(b) Prove Corollary 3.8. 

(c) Show that for the situation of Example 3.7, if the loss function is L s , then the 
UMVU estimator is also the MRE. 

3.20 Let X\, ..., X n be iid from the distribution N(9, 9 2 ). 

(a) Show that this probability model is closed under scale transformations. 
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(b) Show that the MLE is equivariant. 

[The MRE estimator is obtainable from Theorem 3.3, but does not have a simple form. 
See Eaton 1989, Robert 1991, 1994afor more details. Gleser and Healy (1976) consider 
a similar problem using squared error loss.] 

3.21 (a) If S 0 satisfies (3.7) and cS 0 satisfies (3.22), show that o5 0 cannot be unbiased 
in the sense of satisfying E(cSo) = r'. 

(b) Prove the statement made in Example 3.10. 

3.22 Verify the estimator 5* of Example 3.12. 

3.23 If G is a group, a subset Go of G is a subgroup of G if Go is a group under the group 
operation of G. 

(a) Show that the scale group (3.32) is a subgroup of the location-scale group (3.24) 

(b) Show that any equivariant estimator of x r that is equivariant under (3.24) is also 
equivariant under (3.32); hence, in a problem that is equivariant under (3.32), the 
best scale equivariant estimator is at least as good as the best location-scale equiv¬ 
ariant estimator. 

(c) Explain why, in general, if Go is a subgroup of Q, one can expect equivariance under 
Go to produce better estimators than equivariance under G- 

3.24 For the situation of Example 3.13: 

(a) Show that an estimator is equivariant if and only if it can be written in the form 
<p(x/s)s 2 . 

(b) Show that the risk of an equivariant estimator is a function only of f /r. 

3.25 If Xi, ..., X n are iid according to £((;, r), determine the MRE estimator of r for 
the loss functions (a) (3.13) and (b) (3.15) with r = 1 and the MRE estimator of § for 
the loss function (3.43). 

3.26 Show that S satisfies (3.35) if and only if it satisfies (3.40) and (3.41). 

3.27 Determine the bias of the estimator 5*(X) of Example 3.18. 

3.28 Lele (1993) uses invariance in the study of mophometrics, the quantitative analysis of 
biological forms. In the analysis of a biological object, one measures data X on k specific 
points called landmarks, where each landmark is typically two- or three-dimensional . 
Here we will assume that the landmark is two-dimensional (as is a picture), so X is a 
1x2 matrix. A model for X is 


X = (M + Y)T + t 

where M kx2 is the mean form of the object, t is a fixed translation vector, and F is a 2 x 2 
matrix that rotates the vector X. The random variable Y kX 2 is a matrix normal random 
variable, that is, each column of Y is distributed as N( 0, E t ), a k-variate normal random 
variable, and each row is distributed as IV (0, E rf ), a bivariate normal random variable. 

(a) Show that X is a matrix normal random variable with columns distributed as 
N k (MF j, Y, k ) and rows distributed as N 2 (MjY, T'E^r), where V j is the yth column 
of F and Mj is the ith row of M. 

(b) For estimation of the shape of a biological form, the parameters of interest are M, 
E* and E^, with t and T being nuisance parameters. Show that, even if there were 
no nuisance parameters, T, k or E rf is not identifiable. 

(c) It is usually assumed that the (1, 1) element of either E t . or E rf is equal to 1. Show 
that this makes the model identifiable. 



214 


EQUIVARIANCE 


[ 3.B 


(d) The form of a biological object is considered an inherent property of the form 
(a baby has the same form as an adult) and should not be affected by rotations, 
reflections, or translations. This is summarized by the transformation 

X' = XP + b 

where P is a 2 x 2 orthogonal matrix (P'P = I) and b is a k x 1 vector. (See Note 
9.3 for a similar group.) Suppose we observe n landmarks Xj, • • •, X„. Define the 
Euclidean distance between two matrices A and B to be D(A, B ) = — &,y) 2 , 

and let the n x n matrix F have (i, j)th element f]j = D(X t , Xj). Show that F is 
invariant under this group, that is F(X') = F(X). (Lele (1993) notes that F is, in 
fact, maximal invariant.) 

3.29 In (9.1), show that the group X' = AX+b induces the group p! = Ap+b, X' = ASA'. 

3.30 For the situation of Note 9.3, consider the equivariant estimation of /i. 

(a) Show that an invariant loss is of the form L{p , E, 5) = L((p — 5)'S3 1 (/U. — 5)). 

(b) The equivariant estimators are of the form X + c, with c = 0 yielding the MRE 
estimator. 

3.31 For Xi, ..., X„ iid as N p (p, E), the cross-products matrix S is defined by 

n 

S = {Sij} = ^(x /t - Xi)(x jk - xj) 

k=l 

where X\ = (l/«)X!r=i x ik . Show that, for E = /, 

(a) E,[trS] = E, £f =1 £«(*<* - XiX*i t - X t ) = p(n - 1), 

(b) Ej [trS 2 ] = E, ZL ZUZUXik ~ Xi)(X jk - X,)} 2 = (n - 1 )(np -p- 1). 

[These are straightforward, although somewhat tedious, calculations involving the chi- 
squared distribution. Alternatively, one can use the fact that S has a Wishart distribution 
(see, for example, Anderson 1984), and use the properties of that distribution.] 

3.32 For the situation of Note 9.3: 

(a) Show that equivariant estimators of E are of the form cS, where S is the cross- 
products matrix and c is a constant. 

(b) Show that £/{tr[(cS — I)'(cS — /)]} is minimized by c = EjtiS/EpiS 2 . 

[Hint: For part (a), use a generalization of Theorem 3.3; see the argument leading to 
(3.29), and Example 3.11.] 

3.33 For the estimation of E in Note 9.3: 

(a) Show that the loss function in (9.2) is invariant. 

(b) Show that Stein's loss L(S, E) = tr(<5E~‘) — log [5E 1 1 — p, where |A| is the 
determinant of A, is an invariant loss with MRE estimator S/n. 

(c) Show that a loss L(8, E) is an invariant loss if and only if it can be written as a 
function of the eigenvalues of 5E _1 . 

[The univariate version of Stein's loss was seen in (3.20) and Example 3.9. Stein (1956b) 
and James and Stein (1961) used the multivariate version of the loss. See also Dey and 
Srinivasan 1985, and Dey et al. 1987.] 
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3.34 Let X t , ..., X„, and Ij. Y n have joint density 

1 f (Xi x», L Yn\ 

O m T n v o o' X ' X / 

and consider the problem of estimating 9 = (x/o) r with loss function L(p , r\d) = 
y(d/9). This problem remains invariant under the transformations X[ = aX t , Y'- = bYj, 
o' = ao, x' = bx, and d’ = (b/a) r d (a, b > 0), and an estimator <5 is equivariant under 
these transformations if 8(ax, by) = ( b/a) r S(x , y). Generalize Theorems 3.1 and 3.3, 
Corollary 3.4, and (3.19) to the present situation. 

3.35 Under the assumptions of the preceding problem and with loss function (d — 8) 2 /9 2 , 
determine the MRE estimator of 9 in the following situations: 

(a) m = n = 1 and X and Y are independently distributed as F(a, a 2 ) and T(/3, t 2 ), 
respectively (a, f) known). 

(b) X t , ..., X m and Y t , .... Y„ are independently distributed as IV (0, o 2 ) and IV (0, r 2 ), 
respectively. 

(c) X{, ..., X m and ij. Y„ are independently distributed as U (0, ff) and U (0, r), 

respectively. 


3.36 Generalize the results of Problem 3.34 to the case that the joint density of X and Y 
is 


-f 


xi - $ 


yi ~ 9 


y„ - 9 


3.37 Obtain the MRE estimator of 9 = (t/ o) r with the loss function of Problem 3.35 
when the density of Problem 3.36 specializes to 


1 




Xi H 


n if 


yj - 9 


and / is (a) normal, (b) exponential, or (c) uniform. 

3.38 In the model of Problem 3.37 with t = o, discuss the equivariant estimation of 
A = r] — £ with loss function (d — A) 2 /o 2 and obtain explicit results for the three 
distributions of that problem. 

3.39 Suppose in Problem 3.37 that an MRE estimator 8* of A = r] — ij under the trans¬ 
formations X[ = a + bXi and Yj = a + bYj, b > 0, exists when the ratio x/a = c is 
known and that <5* is independent of c. Show that 5* is MRE also when o and r are 
completely unknown despite the fact that the induced group of transformations of the 
parameter space is not transitive. 

3.40 Let f(t) = i be the Cauchy density, and consider the location-scale family 


T = 



, —oo < fl < oo, 0 < o < oo 


(a) Show that this probability model is invariant under the transformation x' = \/x. 

(b) If pi = pL/(p. 2 +o 2 ) andcr' = o/(pr+o 2 ), show that P^, r (X e A) = G A); 

that is. if X has the Cauchy density with location parameter p, and scale parameter 
a , then X' has the Cauchy density with location parameter p/(p 2 + o 2 ) and scale 
parameter o/(p 2 + o 2 ). 

(c) Explain why this group of transformations of the sample and parameter spaces does 
not lead to an invariant estimation problem. 
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[See McCullaugh (1992) for a full development of this model, where it is suggested that 
the complex plane provides a more appropriate parameter space.] 

3.41 Let (Xj, Yj), i = 1,.... n, be distributed as independent bivariate normal random 
variables with mean (/x, 0 ) and covariance matrix 

/ o n a 12 
\°21 O 22 

(a) Show that the probability model is invariant under the transformations 

(x r , y') = (a + bx, by), 

(p.', o' n , ffp, <%,' 2 ) = (a + bfi, b 2 o n , b 2 0 \ 2 , b 2 o 22 ). 

(b) UsingthelossfunctionL(/x, d ) = {fjL—d) 2 / 0 [\. show that this is aninvariant estima¬ 
tion problem, and equivariant estimators must be of the form 8 = x+ijf(ui,u 2 , u 2 )y, 
where u t = SO q - x fly 2 , u 2 = E(y,- - y) 2 /y 2 , and n 3 = E(Xj - x)(y t - y)/y 2 . 

(c) Show that if <5 has a finite second moment, then it is unbiased for estimating /x. Its 
risk function is a function of o n /o 22 and cr 12 /cr 22 . 

(d) If the ratio cr 12 /cr 22 is known, show that X — ( o l2 /o 22 )Y is the MRE estimator of /x. 

[This problem illustrates the technique of covariance adjustment. See Berry, 1987.] 

3.42 Suppose we let Xi, ..., X„ be a sample from an exponential distribution /(.v|/x, a) = 
(1 /o)e~ u ~ ,/ ' )/a I(x > ft). The exponential distribution is useful in reliability theory, and 
a parameter of interest is often a quantile, that is, a parameter of the form /.t + bo, 
where b is known. Show that, under quadratic loss, the MRE estimator of /x + bo is 
S 0 = x m + (b — 1 /n)(x — .tqi)), where x m = min, x t . 

[Rukhin and Strawderman (1982) show that So is inadmissible, and exhibit a class of 
improved estimators.] 


Section 4 

4.1 (a) Suppose X ; : a 2 ) with = a + fitj. If the first column of the matrix C 

leading to the canonical form (4.7) is (1/v^.1 find the second column 

of C. 

(b) If Xj : N(^i , o 2 ) with £,■, = a + fitj + ytf, and the first two columns of C are those 
of (a), find the third column under the simplifying assumptions Ef, = 0, E t 2 = 1. 
[Note: The orthogonal polynomials that are progressively built up in this way are 
frequently used to simplify regression analysis.] 

4.2 Write out explicit expressions for the transformations (4.10) when fin is given by 

(a) = a + f)tj and (b) ft = a + fiu + ytf. 

4.3 Use Problem 3.10 to prove (iii) of Theorem 4.3. 

4.4 (a) In Example 4.7, determine a, fi, and hence ft by minimizing E(X; — a — fitj) 2 . 

(b) Verify the expressions (4.12) for a and fi, and the corresponding expressions fora 
and /§. 

4.5 In Example 4.2, find the UMVU estimators of a, f}, y. and o 2 when E/, = 0 and 
E tf = 1. 

4.6 Let Xjj be independent V(ft 2 , a 2 ) with ft 2 = a, + fitij. Find the UMVU estimators 
of the a,- and fi. 

4.7 (a) In Example 4.9, show that the vectors of the coefficients in the <$,- are not or¬ 

thogonal to the vector of the coefficients of /x. 
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(b) Show that the conclusion of (a) is reversed if a, and jl are replaced by ctj and jl. 

4.8 In Example 4.9, find the UMVU estimator of /x when the are known to be zero 
and compare it with jl. 

4.9 The coefficient vectors of the X ijk given by (4.32) for jl, otj, and are orthogonal 
to the coefficient vectors for the Yij given by (4.33). 

4.10 In the model defined by (4.26) and (4.27), determine the UMVU estimators of a,-, 
jij, and a 1 under the assumption that the Yij are known to zero. 

4.11 (a) In Example 4.11, show that 

EE S(X, Jt - /x - a, - 0j - Yijf = S 2 + S 2 + S 2 + S 2 + S; 

where S 2 = SEE(X^-X,,.) 2 , S' 2 = I Jm(X...~ /x) 2 , S 2 a = JmY,{X h .-X... -a,) 2 , 
and Sp, S 2 y are defined analogously. 

(b) Use the decomposition of (a) to show that the least squares estimators of /x, <*,,... 
are given by (4.32) and (4.33). 

(c) Show that the error sum of squares S 2 is equal to E EE (X ijJt — fy) 2 and hence in 
the canonical form to E " =s+l Y 2 . 

4.12 (a) Show how the decomposition in Problem 4.11(a) must be modified when it is 
known that the Yij are zero. 

(b) Use the decomposition of (a) to solve Problem 4.10. 

4.13 Let X ijk (i = 1, ..., I, j = 1, ..., J, k = 1. K) be N(f ljk , a 2 ) with 

Hijk = /X + O'; + fj + Yk 

where E a f = E flj = E Yk = 0. Express /x, f j, and Yk in terms of the §’s and find 
their UMVU estimators. Viewed as a special case of (4.4), what is the value of j? 

4.14 Extend the results of the preceding problem to the model 

Hijk = M + a i + Pj +Yk + &ij + + ^jk 

where 

T. S u = S ‘i = 12 Sik = XI Sik - X i k = XI k i k = 

i j i k j k 

4.15 In the preceding problem, if it is known that the L’s are zero, determine whether the 
UMVU estimators of the remaining parameters remain unchanged. 

4.16 (a) Show that under assumptions (4.35), if £ = 6 A, then the least squares estimate 
of 9 is xA(AA') -1 . 

(b) If (X, A) is multivariate normal with all parameters unknown, show that the least 
squares estimator of part (a) is a function of the complete sufficient statistic and, 
hence, prove part (a) of Theorem 4.14. 

4.17 A generalization of the order statistics, to vectors, is given by the following defini¬ 
tion. 

Definition 8.1 The Cj -order statistics of a sample of vectors are the vectors arranged 
in increasing order according to their j th components. 

Let X ; , i = 1be an iid sample of p x 1 vectors, and let X = (X k , ..., X„) be a 
p x n matrix. 

(a) If the distribution of X, is completely unknown, show that, for any j, j = 1,..., p, 
the cj -order statistics of (Xj, ..., X„) are complete sufficient. (That is, the vectors 
Xi, ..., X n are ordered according to their j th coordinate.) 
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(b) Let y lx „ be a random variable with unknown distribution (possibly different from 

'x N 


X ; ). Form the (p — 1) x n matrix 


and for any j = 1, ..., p, calculate the 


c,-order statistics based on the columns of 


. Show that these C; -order statistics 


are sufficient. 

[Flint: See Problem 1.6.33, and also TSF12, Chapter 4, Problem 12.] 
(c) Use parts (a) and (b) to prove Theorem 4.14(b). 


[Flint: Part (b) implies that only a symmetric function of (X, A) need be considered, and 
part (a) implies that an unconditionally unbiased estimator must also be conditionally 
unbiased. Theorem 4.12 then applies.] 

4.18 The proof of Theorem 4.14(c) is based on two results. Establish that: 


(a) For large values of 9, the unconditional variance of a linear unbiased estimator will 
be greater than that of the least squares estimator. 

(b) Ford = 0, thevariance of XAfAA'y 1 is greater than that of XA[£(AA')] _1 . [You 
may use the fact that E(AA')~ l — [EjAA ')]' 1 is a positive definite matrix (Mar¬ 
shall and Olkin 1979; Shaffer 1991). This is a multivariate extension of Jensen’s 
inequality.] 

(c) Parts (a) and (b) imply that no best linear unbiased estimator of Ey,f, exists if 
EAA' is known. 

4.19 (a) Under the assumptions of Example 4.15, find the variance of E7. ; S?. 

(b) Show that the variance of (a) is minimized by the values stated in the example. 

4.20 In the linear model (4.4), a function Ec,f, with Ec, = 0 is called a contrast. Show 

that a linear function E dfa is a contrast if and only if it is translation invariant, that is, 
satisfies E</,•(£,• + a) = Ed,-^, for all a, and hence if and only if it is a function of the 
differences . 

4.21 Determine which of the following are contrasts: 

(a) The regression coefficients a. f), or y of (4.2). 

(b) The parameters p, a,-, fij, or Yij of (4.27). 

(c) The parameters p or of (4.23) and (4.24). 


Section 5 

5.1 In Example 5.1: 


(a) Show that the joint density of the is given by (5.2). 

(b) Obtain the joint multivariate normal density of the X t j directly by evaluating their 
covariance matrix and then inverting it. 

[ Hint : The covariance matrix of Xu, ..., X \ n \...; X,i, ..., X sn has the form 


/E, 0 ... 0\ 
0 S 2 ... 0 


Vo 0 ... E J 


where each E, is an n x n matrix with a value a t for all diagonal elements and a 
value bj for all off-diagonal elements. For the inversion of E,, see the next problem.] 
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5.2 Let A = (dij) be a nonsingular n x n matrix with a u = ci and a y = b for all i j. 

Determine the elements of A -1 . [Hint: Assume that A -1 = (cy) with cn = c and cy = d 
for all i ^ j, calculate c and d as the solutions of the two linear equations ZayCy = 1 
and Y.ci\jCj 2 = 0, and check the product AC.] 

5.3 Verify the UMVU estimator ofo\/o 2 given in Example 5.1. 

5.4 Obtain the joint density of the Xy in Example 5.1 in the unbalanced case in which 
j = 1, with the Hi not all equal, and determine a minimal set of sufficient statistics 
(which depends on the number of distinct values of n, ). 

5.5 In the balanced one-way layout of Example 5.1, determine lim P(a\ < 0) as n —> oo 
for a\/a 2 = 0, 0.2, 0.5 , 1, and 5 = 3, 4. 5, 6. [Hint: The limit of the probability can be 
expressed as a probability for a x s 2 _i variable.] 

5.6 In the preceding problem, calculate values of P (aj < 0) for finite n. When would you 
expect negative estimates to be a problem? [The probability P{a\ < 0), which involves 
an F random variable, can also be expressed using the incomplete beta function, whose 
values are readily available through either extensive tables or computer packages. Searle 
et al. (1992, Section 3.5d) look at this problem in some detail.] 

5.7 The following problem shows that in Examples 5.1-5.3 every unbiased estimator of 
the variance components (except a 2 ) takes on negative values. (For some related results, 
see Pukelsheim 1981.) 

Let X have distribution P e V and suppose that T is a complete sufficient statistic for V. 
If g(P) is any U -estimable function defined over V and its UMVU estimator rj(T) takes 
on negative values with probability > 0. then show that this is true of every unbiased 
estimator of g(P). [Hint: For any unbiased estimator 8, recall that E(8\T) = r]{T).] 

5.8 Modify the car illustration of Example 5.1 so that it illustrates (5.5). 

5.9 In Example 5.2, define a linear transformation of the Xy* leading to the joint dis¬ 
tribution of the Zijk stated in connection with (5.6), and verify the complete sufficient 
statistics (5.7). 

5.10 In Example 5.2, obtain the UMVU estimators of the variance components aj, <rj, 
and cr 2 when er^ = 0, and compare them to those obtained without this assumption. 

5.11 For the Xy* given in (5.8), determine a transformation taking them to variables Zy* 
with the distribution stated in Example 5.3. 

5.12 In Example 5.3, obtain the UMVU estimators of the variance components ffj, uj, 
and a 2 . 

5.13 In Example 5.3, obtain the UMVU estimators of a\ and a 2 when = 0 so that the 
B terms in (5.8) drop out, and compare them with those of Problem 5.12. 

5.14 In Example 5.4: 

(a) Give a transformation taking the variables Xy* into the VTyj, with density (5.11). 

(b) Obtain the UMVU estimators of //, a h crj, and a 2 . 

5.15 A general class of models containing linear models of Types I and II, and mixed 
models as special cases assumes that the 1 x n observation vector X is normally dis¬ 
tributed with mean 6 A as in (4.13) and with covariance matrix y,-Vj where the y’s 
are the components of variance and the Vj ’s are known symmetric positive semidefinite 
n x n matrices. Show that the following models are of this type and in each case specify 
the y’s and V’s: (a) (5.1 ); (b) (5.5); (c) (5.5) without the terms Cy; (d) (5.8); (e) (5.10). 
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5.16 Consider a nested three-way layout with 

Xijkl — j-L + CCj + bjj + Cjjk + Cjjkl 

(i = 1, ..., I\j = 1, ..., J\k = 1, ..., K;l = 1,..., n) in the versions 

(a) cii — a ,, bjj — fijj , Cjjk — Yijki 

(b) cij — ff,-, bjj — bjj, Cjjk — Cjjki 

(c) cij — a ,, bjj — Bjj, ('/,/■ — Cjj^, 

(d) ctj — A,-, bjj — Bjj , Cjjk — Cjjk, 

where the cr’s, /i’s, and j/’s are unknown constants defined uniquely by the usual conven¬ 
tions, and the A’s, B’s, C’s, and U’s are unobservable random variables, independently 
normally distributed with means zero and with variances a\, crj, a <2 and a 2 . 

In each case, transform the Xjj k i to independent variables Z,- jW and obtain the UMVU 
estimators of the unknown parameters. 

5.17 For the situation of Example 5.5, relax the assumption of normality to only assume 

that Aj and t/, ; have zero means and finite second moments. Show that among all linear 
estimators (of the form CjjXjj, Cjj known), the UMVU estimator of p. + a f (the best 

linear predictor) is given by (5.14). 

[This is a Gauss-Markov theorem for prediction in mixed models. See Harville (1976) 
for generalizations.] 


Section 6 

6.1 In Example 6.1, show that y,- ; - = 0 for all i, j is equivalent to p t j = p i+ p + j. [Hint: 

Yij = %ij — %j. — = 0 implies p t j = ctjbj and hence p i+ = ccij and p +j = bj/c for 

suitable cij, bj, and c > 0 .] 

6.2 In Example 6.2, show that the conditional independence of A, B given C is equivalent 
to afj% c = afj B = 0 for all i, j, and k. 

6.3 In Example 6.1, show that the conditional distribution of the vectors (n,i, ..., nu) 

given the values of /?,+ (/ = 1 , ...,/) is that of / independent vectors with multinomial 
distribution M(p lv . pj\j:n i+ ) where p JV = Pij/p i+ . 

6.4 Show that the distribution of the preceding problem also arises in Example 6.1 when 
the n subjects, rather than being drawn from the population at large, are randomly drawn: 
n i+ from Category A\, ... ,ni+ from Category A/. 

6.5 An application of log linear models in genetics is through the Hardy-Weinberg model 
of mating. If a parent population contains alleles A, a with frequencies p and 1 — p, 
then standard random mating assumptions will result in offspring with genotypes AA, 
Aa, and aa with frequencies Qj = p 2 ,66 = 2p(l — p), and 0 3 = (1 — p) 2 . 

(a) Give the full multinomial model for this situation, and show how the Hardy- 
Weinberg model is a non-full-rank submodel. 

(b) For a sample Xj, .... X n of n offspring, find the minimal sufficient statistic. 

[See Brown (1986a) for a more detailed development of this model.] 

6.6 A city has been divided into I major districts and the 6 th district into J t subdistricts, 
all of which have populations of roughly equal size. From the police records for a given 
year, a random sample of n robberies is obtained. Write the joint multinomial distribution 
of the numbers iijj of robberies in subdistrict (i, j) for this nested two-way layout as 

with %jj = fi +ctj + bij where bij = 0 , and show that the assumption 

bij = 0 for all i, j is equivalent to the assumption that = pj+/Jj for all i, j. 
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6.7 Instead of a sample of fixed size n in the preceding problem, suppose the observations 
consist of all robberies taking place within a given time period, so that n is the value 
taken on by a random variable N. Suppose that N has a Poisson distribution with 
unknown expectation A and that the conditional distribution of the «y given N = n is the 
distribution assumed for the ny in the preceding problem. Find the UMVU estimator of 
kpij and show that no unbiased estimator py exists. [ Hint: See the following problem.] 

6.8 Let N be an integer-valued random variable with distribution P$(N = n) = P$(n), 

n = 0.for which N is complete. Given N = n, let X have the binomial distribution 

b(p , n) for n > 0, with p unknown, and let X = 0 when n = 0. For the observations 
( N , X): 

(a) Show that (N, X) is complete. 

(b) Determine the UMVU estimator of pEg(N). 

(c) Show that no unbiased estimator of any function g(p) exists if Pg(0) > 0 for some 
9. 

(d) Determine the UMVU estimator of p if Pg( 0) for all 9. 


Section 7 

7.1 (a) Consider a population {«],..., a#} with the parameter space defined by the 

restriction a\ + ■ ■ ■ + = A (known). A simple random sample of size n is drawn 

in order to estimate r 2 . Assuming the labels to have been discarded, show that 
F (1) , ..., Y (n) are not complete. 

(b) Show that Theorem 7.1 need not remain valid when the parameter space is of the 
form V, x V 2 x ■ ■ ■ x V N . [Hint: Let N = 2, n = 1, Vi = {1, 2), V 2 = (3, 4).] 

7.2 If L],..., Y„ are the sample values obtained in a simple random sample of size n from 
the finite population (7.2), then (a) E(Yj) = a, (b) var(L,) = r 2 , and (c) cov(T,-, Yj) = 
-t 2 /(N - 1 ). 

7.3 Verify equations (a) (7.6). (b) (7.8), and (c) (7.13). 

7.4 For the situation of Example 7.4: 

(a) Show that £T v _i = E[^ Y] = a. 

(b) Show that [^j — 4]^ 54i _1 (y — T„_i ) 2 is an unbiased estimator of var(F„_i). 

[Pathak (1976) proved (a) by first showing that EY\ = a, and then that £Ti|7o = F„_!. 
To avoid trivialities, Pathak also assumes that C, + C, < Q for all i, j, so that at least 
three observations are taken.] 

7.5 Random variables Xi, ..., X n are exchangeable if any permutation of Xi, ..., X„ 
has the same distribution. 

(a) If Xi, ..., X„ are iid, distributed as Bernoulli (p), show that given X, = 
f, Xx, ..., X n are exchangeable (but not independent). 

(b) For the situation of Example 7.4, show that given T = ((Ci, X)), ..., (C„, X v )}, 
the v — 1 preterminal observations are exchangeable. 

The idea of exchangeability is due to deFinetti (1974). who proved a theorem that char¬ 
acterizes the distribution of exchangeable random variables as mixtures of iid random 
variables. Exchangeable random variables play a large role in Bayesian statistics; see 
Bernardo and Smith 1994 (Sections 4.2 and 4.3). 
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7.6 For the situation of Example 7.4, assuming that (a) and (b) hold: 

(a) Show that a of (7.9) is UMVUE for a. 

(b) Defining S 2 = Y^i=\(-Yi ~ Y)/(v — 1), show that 

.2 c2 MS iv]^r l -S 2 

a =S - V=2 - 

is UMVUE for r 2 of (7.7), where MS [„] is the variance of the observations in the 
set (7.10). 

[Kremers (1986) uses conditional expectation arguments (Rao-Blackwellization), 
and completeness, to establish these results. He also assumes that at least no obser¬ 
vations are taken. To avoid trivialities, we can assume n 0 > 3.] 


7.7 In simple random sampling, with labels discarded, show that a necessary condition 
for h(ci\ ,..., a w ) to be U -estimable is that h is symmetric in its N arguments. 

7.8 Prove Theorem 7.7. 

7.9 Show that the approximate variance (7.16) for stratified sampling with n,- = nNi/N 
(proportional allocation) is never greater than the corresponding approximate variance 
t 2 /n for simple random sampling with the same total sample size. 

7.10 Let V p be the exact variance (7.15) and V r the corresponding variance for simple 
random sampling given by (7.6) with n = En ; , N = EiV), Hj/n = Nj/N and r 2 = 
EE (fly — a..) 2 /N. 

(a) Show that V r -V p = [s(V,(fl,-. - a ..) 2 - ^E^flV,-r 2 ]. 

(b) Give an example in which V r < V p . 


7.11 The approximate variance (7.16) for stratified sampling with a total sample size 
n = m +'•■•+ « s is minimized when n, is proportional to IV,-t,-. 

7.12 For sampling designs where the inclusion probabilities 7 r/ = Ylsies PU) of including 
the zth sample value Y t is known, a frequently used estimator of the population total is 
the Horvitz-Thompson (1952) estimator S HT = JT Yj/jij. 


(a) Show that S HT is an unbiased estimator of the population total. 

(b) The variance of S HT is given by 


wai(S HT ) = Y 2 


1 

7F 


i¥j 


_ i 

\_7Tj7Zj 


where :ry are the second-order inclusion probabilities 7Ty = /es P(s). 

Note that it is necessary to know the labels in order to calculate S HT , thus Theorem 7.5 
precludes any overall optimality properties. See Hedayat and Sinha 1991 (Chapters 2 
and 3) for a thorough treatment of Sht- 

7.13 Suppose that an auxiliary variable is available for each element of the population 

(7.2) so that 9 = {(1, ai, b\) .( N , a N , b N )}. If Y l , ..., Y„ and Z\, ... , Z„ denote the 

values of a and b observed in a simple random sample of size n, and Y and Z denote 
their averages, then 

cov(F, Z) = E(Y - a)(Z - b) = *~ n S (a, - d){bi ~ b). 

nN(N - 1) 



3.9] 


NOTES 


223 


7.14 Under the assumptions of Problem 7.13, if B =b\ + - ■ -+b N is known, an alternative 
unbiased estimator a is 


«h z ‘ 


b + 


n(N - 1) 


Y-I-J2- 

' n Z, 
1 = 1 


(n - 1 )N 

[Hint: Use the facts that E(Yi/Z t ) = (l/)V)E(a;/fe,) and that by the preceding problem 
E I 


1 


Yi 

Z — (Zi - Z) 
n 1 Z, 


1 Ch 

—— S Mbt-b) 
N — 1 bj 


7.15 In connection with cluster sampling, consider a set W of vectors (fli, 
and the totality G of transformations taking (cq, ..., Qm) into (a[, ..., a ' M ) such that 
(a[, ..., a' M ) e W and = Sa, . Give examples of W such that for any real number 
<7[ there exist a 2 > • • •, a M with (fli.%) e W and such that 


(a) G consists of the identity transformation only; 

(b) G consists of the identity and one other element; 

(c) G is transitive over W. 

7.16 For cluster sampling with unequal cluster sizes M, . Problem 7.14 provides an al¬ 
ternative estimator of a, with Af, in place of b t . Show that this estimator reduces to Y if 
b\ = ■ ■ ■ = bn and hence when the M,- are equal. 

7.17 Show that (7.17) holds if and only if S depends only on X ", defined by (7.18). 


9 Notes 

9.1 History 

The theory of equivariant estimation of location and scale parameters is due to Pitman 
(1939), and the first general discussions of equivariant estimation were provided by 
Peisakoff (1950) and Kiefer (1957). The concept of risk-unbiasedness (but not the term) 
and its relationship to equivariance were given in Lehmann (1951). 

The linear models of Section 3.4 and Theorem 4.12 are due to Gauss. The history of 
both is discussed in Seal (1967); see also Stigler 1981. The generalization to exponential 
linear models was introduced by Dempster (1971) and Nelder and Wedderburn (1972). 
The notions of Functional Equivariance and Formal Invariance, discussed in Section 
3.2, have been discussed by other authors sometimes using different names. Functional 
Equivariance is called the Principle of Rational Invariance by Berger (1985, Section 
6.1), Measurement Invariance by Casella and Berger (1990, Section 7.2.4) and Pa¬ 
rameter Invariance by Dawid (1983). Schervish (1995, Section 6.2.2) argues that this 
principle is really only a reparametrization of the problem, and has nothing to do with 
invariance. This is almost in agreement with the principle of functional equivariance, 
however, it is still the case that when reparameterizing one must be careful to properly 
reparameterize the estimator, density, and loss function, which is part of the prescription 
of an invariant problem. This type of invariance is commonly illustrated by the example 
that if S measures temperature in degrees Celsius, then (9/5)5 + 32 should be used to 
measure temperature in degrees Fahrenheit (see Problems 2.9 and 2.10). 

What we have called Formal Invariance was also called by that name in Casella and 
Berger (1990), but was called the Invariance Principle by Berger (1985) and Context 
Invariance by Dawid (1983). 
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9.2 Subgroups 

The idea of improving an MRE estimator by imposing equivariance only under a sub¬ 
group was used by Stein (1964), Brown (1968), and Brewster and Zidek (1974) to find 
improved estimators of a normal variance. Stein’s 1964 proof is also discussed in detail 
by Maatta and Casella (1990), who give a history of decision-theoretic variance estima¬ 
tion. The proof of Stein (1964) contains key ideas that were further developed by Brown 
(1968), and led to Brewster and Zidek (1974) finding the best equivariant estimator of 
the form (2.33). [See Problem 2.14.] 

9.3 General Linear Group 

The general linear group (also called the full linear group) is an example of a group that 
can be thought of as a multivariate extension of the location-scale group. Let X \, ..., X„ 
be iid according to a p-variate normal distribution N p (p, E), and define X as the p x n 

matrix (X), ..., X„) and X as the n x 1 vector (X).X„). Consider the group of 

transformations 

X' = AX + b 

(9.1) p! = Afi + b, E' = AHA', 

where A is a p x p nonsingular matrix and b is a p x 1 vector. [The group of real 
p x p nonsingular matrices, with matrix multiplication as the group operation is called 
the general linear group, denoted Ql p (see Eaton 1989 for a further development). The 
group (9.1) adds a location component.] 

Consider now the estimation of E. (The estimation of p is left to Problem 3.30.) An 
invariant loss function, analogous to squared error loss, is of the form 

(9.2) L(E, 5) = tr[E ‘(5 - E)E _1 (<5 - E)] = tr[E^ 1/2 5E' 1/2 - 7] 2 , 

where tr[-] is the trace of a matrix (see Eaton 1989, Example 6.2, or Olkin and Selliah 
1977). It can be shown that equivariant estimators are of the form cS, where 5 = (X — 
1X')(X-1X')' with 1 a p x 1 vector of l’s and c a constant, is the cross-products matrix 
(Problem 3.31). Since the group is transitive, the MRE estimator is given by the value 
of c that minimizes 

(9.3) E,L{1, cS) = £/tr(c5 - I)'(cS - I), 
that is, the risk with E = / . Since 

E,trleS - 7)'(cS - /) = c 2 £,trS 2 - 2cE / trS + p. 

the minimizing c is given by c = £/ trS/ £/ trS 2 . Note that, for p = 1, this reduces to the 
best equivariant estimator of quadratic loss in the scalar case. Other equivariant losses, 
such as Stein’s loss (3.20), can be handled in a similar manner. See Problems 3.29-3.33 
for details. 

9.4 Finite Populations 

Estimation in finite populations has, until recently, been developed largely outside the 
mainstream of statistics. The books by Cassel, Samdal, and Wretman (1977) and Sarndal, 
Swenson, and Wretman (1992) constitute important efforts at a systematic presentation 
of this topic within the framework of theoretical statistics. The first steps in this direction 
were taken by Neyman (1934) and by Blackwell and Girshick (1954). The need to 
consider the labels as part of the data was first emphasized by Godambe (1955). Theorem 
7.1 is due to Watson (1964) and Royall (1968), and Theorem 7.5 to Basu (1971). 



CHAPTER 4 


Average Risk Optimality 


1 Introduction 

So far, we have been concerned with finding estimators which minimize the risk 
R(9, 8 ) at every value of 6. This was possible only by restricting the class of es¬ 
timators to be considered by an impartiality requirement such as unbiasedness or 
equivariance. We shall now drop such restrictions, admitting all estimators into 
competition, but shall then have to be satisfied with a weaker optimality prop¬ 
erty than uniformly minimum risk. We shall look for estimators that make the 
risk function R(6, 8 ) small in some overall sense. Two such optimality proper¬ 
ties will be considered: minimizing the (weighted) average risk for some suitable 
non-negative weight function and minimizing the maximum risk. The second (min¬ 
imax) approach will be taken up in Chapter 5; the present chapter is concerned 
with the first of these approaches, the problem of minimizing 

(1.1) r(A, 8) = J R(9, 8)dA(9) 

where we shall assume that the weights represented by A add up to 1, that is, 

(1.2) fdM9)=l, 

so that A is a probability distribution. An estimator <5 minimizing (1.1) is called a 
Bayes estimator with respect to A. 

The problem of determining such Bayes estimators arises in a number of dif¬ 
ferent contexts. 

(i) As Mathematical Tools 

Bayes estimators play a central role in Wald’s decision theory. It is one of the 
main results of this theory that in any given statistical problem, attention can be 
restricted to Bayes solutions and suitable limits of Bayes solutions; given any other 
procedure 8 , there exists a procedure 8' in this class such that R(9, S') < R(9, 8) 
for all values of 9. (In view of this result, it is not surprising that Bayes estimators 
provide a tool for solving minimax problems, as will be seen in the next chapter.) 

(ii) As a Way of Utilizing Past Experience 

It is frequently reasonable to treat the parameter 9 of a statistical problem as 
the realization of a random variable © with known distribution rather than as an 
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unknown constant. Suppose, for example, that we wish to estimate the probability 
of a penny showing heads when spun on a flat surface. So far, we would have 
considered n spins of the penny as a set of n binomial trials with an unknown 
probability p of showing heads. Suppose, however, that we have had considerable 
experience with spinning pennies, experience perhaps which has provided us with 
approximate values of p for a large number of similar pennies. If we believe this 
experience to be relevant to the present penny, it might be reasonable to represent 
this past knowledge as a probability distribution for p, the approximate shape of 
which is suggested by the earlier data. 

This is not as unlike the modeling we have done in the earlier sections as it 
may seem at first sight. When assuming that the random variables representing the 
outcomes of our experiments have normal, Poisson, exponential distributions, and 
so on, we also draw on past experience. Furthermore, we also realize that these 
models are in no sense exact but, at best, represent reasonable approximations. 
There is the difference that in earlier models we have assumed only the shape of 
the distribution to be known but not the values of the parameters, whereas now we 
extend our model to include a specification of the prior distribution. However, this 
is a difference in degree rather than in kind and may be quite reasonable if the past 
experience is sufficiently extensive. 

A difficulty, of course, is the assumption that past experience is relevant to the 
present case. Perhaps the mint has recently changed its manufacturing process, 
and the present coin, although it looks like the earlier ones, has totally different 
spinning properties. Similar kinds of judgment are required also for the models 
considered earlier. In addition, the conclusions derived from statistical procedures 
are typically applied not only to the present situation or population but also to 
those in the future, and extrastatistical judgment is again required in deciding how 
far such extrapolation is justified. 

The choice of the prior distribution A is typically made like that of the dis¬ 
tributions Pg by combining experience with convenience. When we make the 
assumption that the amount of rainfall has a gamma distribution, we probably do 
not do so because we really believe this to be the case but because the gamma 
family is a two-parameter family which seems to fit such data reasonably well 
and which is mathematically very convenient. Analogously, we can obtain a prior 
distribution by starting with a flexible family that is mathematically easy to handle 
and selecting a member from this family which approximates our past experience. 
Such an approach, in which the model incorporates a prior distribution for 9 to 
reflect past experience, is useful in fields in which a large amount of past experi¬ 
ence is available. It can be brought to bear, for example, in many applications in 
agriculture, education, business, and medicine. 

There are important differences between the modeling of the distributions Pg 
and that of A. First, we typically have a number of observations from Pg and can 
use these to check the assumption of the form of the distribution. Such a check of A 
is not possible on the basis of one experiment because the value of 6 under study 
represents only a single observation from this distribution. A second difference 
concerns the meaning of a replication of the experiment. In the models preceding 
this section, the replication would consist of drawing another set of observations 
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from Pg with the same value of 9. In the model of the present section, we would 
replicate the experiment by first drawing another value, 9', of 0 from A and then a 
set of observations from Pg . It might be argued that sampling of the 9 values (choice 
of penny, for example) may be even more haphazard and less well controlled than 
the choice of subjects for an experiment of a study, which assumes these subjects 
to be a random sample from the population of interest. However, it could also be 
argued that the assumption of a fixed value of 9 is often unrealistic. As we will 
see, the Bayesian approaches of robust and hierarchical analysis attempt to address 
these problems. 

(iii) As a Description of a State of Mind 

A formally similar approach is adopted by the so-called Bayesian school, which 
interprets A as expressing the subjective feeling about the likelihood of different 
9 values. In the presence of a large amount of previous experience, the chosen 
A would often be close to that made under (ii), but the subjective approach can 
be applied even when little or no prior knowledge is available. In the latter case, 
for example, the prior distribution A then models the state of ignorance about 9. 
The subjective Bayesian uses the observations X to modify prior beliefs. After 
X = x has been observed, the belief about 9 is expressed by the posterior (i.e., 
conditional) distribution of © given x. 

Detailed discussions of this approach, which we shall not pursue here, can be 
found, for example, in books by Savage (1954), Lindley (1965), de Finetti (1970, 
1974), Box and Tiao (1973), Novick and Jackson (1974), Berger (1985), Bernardo 
and Smith (1994), Robert (1994a) and Gelman et al. (1995). 

A note on notation: In Bayesian (as in frequentist) arguments, it is important to 
keep track of which variables are being conditioned on. Thus, the density of X will 
be denoted by X ~ f(x\9). Prior distributions will typically be denoted by n or 
A with their density functions being n{9\X) or /(A.),where X is another parameter 
(sometimes called a hyperparameter). From these distributions we often calculate 
conditional distributions such as that of 9 given x and A, or X given x (called poste¬ 
rior distributions). These typically have densities, denoted by n(6\x, X) or y(X\x). 
We will also be interested in marginal distributions such as m{x\X). To illustrate, 
jr(9\x, X) = f(x\9)n(9\X)/m(x\X), where m(x\X) = f f(x\9)jt(9\X) d9. 

It is convenient to use boldface to denote vectors, for example, x = (x\,, x„), 
so we can write f(x\9) for the sample density f(x \,..., x„ \9). 

The determination of a Bayes estimator is, in principle, quite simple. First, 
consider the situation before any observations are taken. Then, © has distribution 
A and the Bayes estimator of g(0) is any number d minimizing EL (©, d). Once 
the data have been obtained and are given by the observed value x of X , the prior 
distribution A of © is replaced by the posterior, that is, conditional, distribution of 
© given x and the Bayes estimator is any number (Ax) minimizing the posterior 
risk .E{L[©, <5(;t)]|x}. The following is a precise statement of this result, where, 
as usual, measurability considerations, are ignored. 
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Theorem 1.1 Let 0 have distribution A, and given © = 9, let X have distribu¬ 
tion Pg. Suppose, in addition, the following assumptions hold for the problem of 
estimating g(0) with non-negative loss function L(6, d). 

(a) There exists an estimator So with finite risk. 

(b) For almost all x, there exists a value 5 A (x) minimizing 

(1.3) E{L[®,8{x)]\X=x}. 


Then, Sa(X) is a Bayes estimator. 

Proof. Let S be any estimator with finite risk. Then, (1.3) is finite a.e. since L is 
non-negative. Hence, 

E{L[ 0, 5(x)]|X = x} > E{L[&, c5 A (.r)]|X = x] a.e., 

and the result follows by taking the expectation of both sides. □ 

[For a discussion of some measurability aspects and more detail when L(6, d) = 
p(d — 9), see DeGroot and Rao 1963. Brown and Purves (1973) provide a general 
treatment.] 

Corollary 1.2 Suppose the assumptions of Theorem 1.1 hold. 

(a) If L{9, d) = [d — g(9)] 2 , then 

(1.4) a A (*) = £[g(©)|x] 

and, more generally, if 

(1.5) L(e,d) = w(d)[d- g (d)] 2 , 


then 


( 1 . 6 ) 


s = f w(9)g(9)dA(9\x) = £[n>(0)g(0)|x] 
aU / w(9)dA(9\x) E[w(@)\x] 


(b) IfL(9,d ) = \d—g(9)\, then S^(x) is any median of the conditional distribution 
of & given x. 


(O If 
(1.7) 


L(9,d ) 


0 when \d — 9\ < c 
1 when \d — 9\ > c, 


then <5 a (x) is the midpoint of the interi’al I of length 2c which maximizes 
P[@ e I\x]. 


Proof. To prove part (i), note that by Theorem 1.1, the Bayes estimator is obtained 
by minimizing 

(1.8) £{[g(©) - S(x)] 2 \x}. 

By assumption (a) of Theorem 1.1, there exists So(x) for which (1.8) is finite 
for almost all values of x, and it then follows from Example 1.7.17 that (1.8) is 
minimized by (1.4). 

The proofs of the other parts are completely analogous. □ 
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Example 1.3 Poisson. The parameter 9 of a Poisson(P) distribution is both the 
mean and the variance of the distribution. Although squared error loss Lq(9, 8 ) = 
(9 — S) 2 is often preferred for the estimation of a mean, some type of scaled squared 
error loss, for example, Lk(9, S) = (6 — S) 2 /9 k , may be more appropriate for the 
estimation of a variance. 

If X\,..., X„ are iid PoissonlYl), and 9 has the gammala, b) prior distribution, 
then the posterior distribution is 

( b 

7r(9\x) = Gamma I a + x, - 

V 1 +b 


and the Bayes estimator under Li is given by (see Problem 1.1) 


«*(*) = 


E(9 l ~ k \x) 
E(9~ k \x) 


b 

1 +b 


(x + a — k ) 


for a — k > 0. Thus, the choice of loss function can have a large effect on the 
resulting Bayes estimator. jj 


It is frequently important to know whether a Bayes solution is unique. The 
following are sufficient conditions for this to be the case. 

Corollary 1.4 If the loss function L(9, d) is squared error, or more generally, if it 
is strictly convex in d, a Bayes solution 8 a is unique (a.e. V), where V is the class 
of distributions Pg, provided 

(a) the average risk of <5a with respect to A is finite, and 

(b) if Q is the marginal distribution of X given by 

Q(A) = f Pg(X e A)dA(9), 
then a.e. Q implies a.e. V. 

Proof. For squared error, if follows from Corollary 1.2 that any Bayes estimator 
5 aM with finite risk must satisfy (1.4) except on a set N of x values with Q(N) = 0. 
For general strictly convex loss functions, the result follows by the same argument 
from Problem 1.7.26. □ 

As an example of a case in which condition (b) does not hold, let X have the 
binomial distribution b(p, n), 0 < p < 1, and suppose that A assigns probability 
1/2 to each of the values p = 0 and p = 1. Then, any estimator 8(X) of p with 
5(0) = 0 and 8(n) = 1 is Bayes. 

On the other hand, condition (b) is satisfied when the parameter space is an 
open set which is the support of A and if the probability Pg(X e A) is continuous 
in 9 for any A. To see this, note that Q(N) = 0 implies Pg(N) = 0 (a.e. A) by 
(1.2.23). If there exists 9q with Pg 0 (N) > 0, there exists a neighborhood co of 9q 
in which Pg(N) > 0. By the support assumption, P,\((o) > 0 and this contradicts 
the assumption that Pg(N) = 0 (a.e. A). 

Three different aspects of the performance of a Bayes estimator, or of any other 
estimator 8, may be of interest in the present model. These are (a) the Bayes risk 
(1.1); (b) the risk function R(9, 8) of Section 1.1 [Equation (1.1.10)] [this is the 
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frequentist risk, which is now the conditional risk of 8(X) given Q]\ and (c) the 
posterior risk given .r which is defined by (1.3). 

For the determination of the Bayes estimator the relevant criterion is, of course, 
(a). However, consideration of (b), the conditional risk given 9, as a function of 
9 provides an important safeguard against an inappropriate choice of A (Berger 
1985, Section 4.7.5). Finally, consideration of (c) is of interest primarily to the 
Bayesian. From the Bayesian point of view, the posterior distribution of 0 given 
x summarizes the investigator’s belief about 9 in the light of the observation, and 
hence the posterior risk is the only measure of risk of accuracy that is of interest. 

The possibility of evaluating the risk function (b) of <5 a suggests still another 
use of Bayes estimators. 

(iv) As a General Method for Generating Reasonable Estimators 

Postulating some plausible distributions A provides a method for generating inter¬ 
esting estimators which can then be studied in the conventional way. A difficulty 
with this approach is, of course, the choice of A. Methodologies have been de¬ 
veloped to deal with this difficulty which sometimes incorporate frequentist mea¬ 
sures to assess the choice of A. These methods tend to first select not a single 
prior distribution but a family of priors, often indexed by a parameter (a so-called 
hyperparameter). The family should be chosen so as to balance appropriateness, 
flexibility, and mathematical convenience. From it, a plausible member is selected 
to obtain an estimator for consideration. The following are some examples of these 
approaches, which will be discussed in Sections 4.4 and 4.5. 

• Empirical Bayes. The parameters of the prior distribution are themselves esti¬ 
mated from the data. 

• Hierarchical Bayes. The parameters of the prior distribution are, in turn, mod¬ 
eled by another distribution, sometimes called a hyperprior distribution. 

• Robust Bayes. The performance of an estimator is evaluated for each member 
of the prior class, with the goal of finding an estimator that performs well (is 
robust ) for the entire class. 

Another possibility leading to a particular choice of A corresponds to the third 
interpretation (iii), in which the state of mind can be described as “ignorance.” 
One would then select for A a noninformative prior which tries (in the spirit of 
invariance) to treat all parameter values equitably. Such an approach was developed 
by Jeffreys (1939, 1948, 1961), who, on the basis of invariance considerations, 
suggests as noninformative prior for 9 a density that is proportional to *J\I(0)\, 
where |/(0)| is the determinant of the information matrix. A good account of this 
approach with many applications is given by Berger (1985), Robert (1994a), and 
Bernardo and Smith (1994). Note 9.6 has a further discussion. 

Example 1.5 Binomial. Suppose that X has the binomial distribution b(p. n). A 
two-parameter family of prior distributions for p which is flexible and for which 
the calculation of the conditional distribution is particularly simple is the family 
of beta distributions B{a , b). These densities can take on a variety of shapes (see 
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Problem 1.2) and we note for later reference that the expectation and variance of 
a random variable p with density B(a, b) are (Problem 1.5.19). 


(1.9) 


E(p) = 


a 

a + b 


and 


var (p) = 


ab 

(a + b) 2 (a + b + 1) 


To determine the Bayes estimator of a given estimand g(p), let us first obtain 
the conditional distribution (posterior distribution) of p given x. The joint density 
of X and p is 


T(q + b) x+a _i — \n—x+b—l 

r (a)r(b) p { p> 


The conditional density of p given x is obtained by dividing by the marginal of x, 
which is a function of x alone (Problem 2.1 ). Thus, the conditional density of p 
given x has the form 


(1.10) C(a, b, x)p x+a ~\ 1 - /7 )"--' +fc -> 


Again, this is recognized to be a beta distribution, with parameters 
(1.11) a'=a+x, b'=b + n—x. 


Let us now determine the Bayes estimator of g(p) = p when the loss function 
is squared error. By (1.4), this is 


( 1 . 12 ) 


'SaM = E(p\x) = 


a' + b' 


a + x 
a + b + n 


It is interesting to compare this Bayes estimator with the usual estimator X/n. 
Before any observations are taken, the estimator from the Bayesian approach is 
the expectation of the prior: a/{a + b). Once X has been observed, the standard 
non-Bayesian (for example, UMVU) estimator is X/n. The estimator b,\(X) = 
(a + X)/(a + b + n) lies between these two. In fact, 

a + b 
a + b + n 



(1.13) 


G + X 
a + b + n 


is a weighted average of a/(a + b), the estimator of p before any observations are 
taken, and X/n, the estimator without consideration of a prior. 

The estimator (1.13) can be considered as a modification of the standard esti¬ 
mator X/n in the light of the prior information about p expressed by (1.9) or as 
a modification of the prior estimator a/{a + b) in the light of the observation X. 
From this point of view, it is interesting to notice what happens as a and b —»• oo, 
with the ratio b/a being kept fixed. Then, the estimator (1.12) tends in probability 
to a/(a + b ), that is, the prior information is so overwhelming that it essentially 
determines the estimator. The explanation is, of course, that in this case the beta 
distribution B(a, b) concentrates all its mass essentially at a/(a + b ) [the variance 
in (1.9) tends toward 0], so that the value of p is taken to be essentially known and 
is not influenced by X. (“Don’t confuse me with the facts!”) 

On the other hand, if a and b are fixed, but n —> oo, it is seen from (1.12) that <5 A 
essentially coincides with X/n. This is the case in which the information provided 
by X overwhelms the initial information contained in the prior distribution. 
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The UMVU estimator X/n corresponds to the case a = b = 0. However, 5(0, 0) 
is no longer a probability distribution since L (\ /p(\ — p))dp = oo. Even with 
such an improper distribution (that is, a distribution with infinite mass), it is pos¬ 
sible formally to calculate a posterior distribution given x. This possibility will be 
considered in Example 2.8. j 

This may be a good time to discuss a question facing the reader of this book. 
Throughout, the theory is illustrated with examples which are either completely 
formal (that is, without any context) or stated in terms of some vaguely described 
situation in which such an example might arise. In either case, what is assumed is a 
model and, in the present section, a prior distribution. Where do these assumptions 
come from, and how should they be interpreted? “Let X have a binomial distri¬ 
bution b(p, n) and let p be distributed according to a beta distribution B(a, b). ” 
Why binomial and why beta? 

The assumptions underlying the binomial distribution are (i) independence of the 
n trials and (ii) constancy of the success probability p throughout the series. While 
in practice it is rare for either of these two assumptions to hold exactly - consecutive 
trials typically exhibit some dependence and success probabilities tend to change 
over time (as in Example 1.8.5) - they are often reasonable approximations and 
may serve as identifications in a wide variety of situations arising in the real world. 
Similarly, to a reasonable degree, approximate normality may often be satisfied 
according to some version of the central limit theorem, or from past experience. 

Let us next turn to the assumption of a beta prior for p. This leads to an estimator 
which, due to its simplicity, is highly prized for a variety of reasons. But simplicity 
of the solution is of little use if the problem is based on assumptions which bear 
no resemblance to reality. 

Subjective Bayesians, even though perhaps unable to state their prior precisely, 
will typically have an idea of its shape: It may be bimodal, unimodal (symmetric or 
skewed), or it may be L- or U-shaped. In the first of these cases, a beta prior would 
be inappropriate since no beta distribution has more than one mode. However, by 
proper choice of the parameters a and b, a beta distribution can accommodate itself 
to each of the other possibilities mentioned (Problem 1.2), and thus can represent 
a considerable variety of prior shapes. 

The modeling of subjective priors discussed in the preceding paragraph corre¬ 
spond to the third of the four interpretations of the Bayes formalism mentioned 
at the beginning of the section. A very different approach is suggested by the 
fourth interpretation, where formal priors are used simply as a method of generat¬ 
ing a reasonable estimator. A standard choice in this case is to treat all parameter 
values equally (which corresponds to a subjective prior modeling ignorance). In 
the nineteenth century, the preferred choice for this purpose in the binomial case 
was the uniform distribution for p over (0, 1), which is the beta distribution with 
a = b = 1. As an alternative, the Jeffreys prior corresponding to a = b = 1/2 (see 
the discussion preceding Example 1.5) has the advantage of being invariant under 
change of parameters (Schervisch 1995, Section 2.3.4). The prior density in this 
case is proportional to [p{ 1 — p)\~ 1 ^ 2 , which is U-shaped. It is difficult to imagine 
many real situations in which an investigator believes that it is equally likely for 
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the unknown p to be close to either 0 or 1. In this case, the fourth interpretation 
would therefore lead to very different priors from those of the third interpretation. 


2 First Examples 

In constructing Bayes estimators, as functions of the posterior density, some 
choices are made (such as the choice of prior and loss function). These choices will 
ultimately affect the properties of the estimators, including not only risk perfor¬ 
mance (such as bias and admissibility) but also more fundamental considerations 
(such as sufficiency). In this section, we look at a number of examples to illustrate 
these points. 

Example 2.1 Sequential binomial sampling. Consider a sequence of binomial 
trials with a stopping rule as in Section 3.3. Let X, Y , and N denote, respectively, 
the number of successes, the number of failures, and the total number of trials at 
the moment sampling stops. The probability of any sample path is then p x ( 1 — p) y 
and we shall again suppose that p has the prior distribution B(a, b). What now is 
the posterior distribution of p given X and Y (or equivalently X and N = X + Y)7 
The calculation in Example 1.3 shows that, as in the fixed sample size case, it is the 
beta distribution with parameters a' and b' given by (1.11), so that, in particular, 
the Bayes estimator of p is given by (1.12) regardless of the stopping rule. j 


Of course, there are stopping rules which even affect Bayesian inference (for 
example, ’’stop when the posterior probability of an event is greater than .9”). 
However, if the stopping rule is a function only of the data, then the Bayes inference 
will be independent of it. These so-called proper stopping rules, and other aspects 
of inference under stopping rules, are discussed in detail by Berger and Wolpert 
(1988, Section 4.2). See also Problem 2.2 for another illustration. 

Thus, Example 2.1 illustrates a quite general feature of Bayesian inference: 
The posterior distribution does not depend on the sampling rule but only on the 
likelihood of the observed results. 


Example 2.2 Normal mean. Let X\, _ X„ he iid as N(6, er 2 ), with a known, 

and let the estimand be 0. As a prior distribution for 0, we shall assume the 
normal distribution N(n, b 2 ). The joint density of 0 and X = (Xi, ..., X„) is then 
proportional to 


( 2 . 1 ) 


/(x, 0) = exp 


2 ^ 2>~" )2 


i =1 


exp 


- 2 ^-^ 


To obtain the posterior distribution of 0|x, the joint density is divided by the 
marginal density of X, so that the posterior distribution has the form C(x) f(x\0). 
If C(x) is used generically to denote any function of x not involving 9, the posterior 
density of ©|x is 


^’(x)g-(. 1 / 2 ) 6,2 [' ! /^ 2 +l/* 2 ]+6[nr/cr 2 +M/fc 2 ] 

r 1 ( n 1 

= C(x)exp + 



nx/o 2 + p/b 1 1 
n/o 2 + 1 /b 2 j 
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This is recognized to be the normal density with mean 

nx/a 2 + n/b 1 


( 2 . 2 ) 

and variance 
(2.3) 


£(©|x) = 


var(©|x) = 


n/a 2 +1 /b 2 
1 


n/a 2 +\/b 2 

When the loss is squared error, the Bayes estimator of 9 is given by (2.2) and 
can be rewritten as 

n/a 2 \ / \/b 2 


(2.4) 


<5 a (x) = 


n/a 2 +1 /b 2 


x + 


n/a 2 + \/b 2 


li, 


and by Corollary 2.7.19, this result remains true for any loss function p(d — 9) 
for which p is convex and even. This shows 8 A to be a weighted average of the 
standard estimator X , and the mean /i of the prior distribution, which is the Bayes 
estimator before any observations are taken. As n -a oo with fi and b fixed, S A (X) 
becomes essentially the estimator X, and S A (X) -a- 9 in probability. As b -> 0, 
S A (X) —> n in probability, as is to be expected when the prior becomes more 
and more concentrated about pt. As b —> oo, 8 A (X) essentially coincides with X, 
which again is intuitively reasonable. These results are analogous to those in the 
binomial case. See Problem 2.3. j 


It was seen above that X is the limit of the Bayes estimators as b -a- oo. As 
b —> oo, the prior density tends to Lebesgue measure. Since the Fisher information 
1(9 ) of a location parameter is constant, this is actually the Jeffrey’spriormentioned 
under (iv) earlier in the section. It is easy to check that the posterior distribution 
calculated from this improper prior is a proper distribution as soon as an observation 
has been taken. This is not surprising; since X is normally distributed about 9 with 
variance 1, even a single observation provides a good idea of the position of 9. 

As in the binomial case, the question arises whether X is the Bayes solution 
also with respect to a proper prior A. This question is answered for both cases by 
the following theorem. 

Theorem 2.3 Let © have a distribution A, and let Pg denote the conditional 
distribution of X given 9. Consider the estimation of g(9) when the loss function 
is squared error. Then, no unbiased estimator S(X) can be a Bayes solution unless 

(2.5) £[S(7O-g(©)] 2 =0, 

where the expectation is taken with respect to variation in both X and ©. 

Proof Suppose b(X) is a Bayes estimator and is unbiased for estimating g(9). 
Since <5(2Q is Bayes and the loss is squared error, 

S(X) = £[g(©)|X], 

with probability 1. Since 5(X) is unbiased. 


E[8(X)\6] = g(9) for all 9. 
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Conditioning on X and using (1.6.2) leads to 

£[g(0)5(X)] = £{S(X)£[g(0)|X]} = £[«5 2 (X)]. 

Conditioning instead on ©, we find 

£[*(©)«(*)] = £{g(0)E[S(X)|0]} = £[g 2 (©)]. 

It follows that 

£[<5(X) - S(©)] 2 = £[<5 2 (X)] + £[g 2 (©)] - 2E[S(X)g(®)] = 0, 
as was to be proved. □ 

Let us now apply this result to the case that S(x ) is the sample mean. 

Example 2.4 Sample means. If X-,, i = 1are iid with E(X,) = 0 and 
var Xj = a 2 (independent of 9 ), then the risk of X (given 9) is 

R(0, X) = E(X — 0) 2 = a 2 /n. 

For any proper prior distribution on 0, 

E(X — ©) 2 = a 2 1n ^ 0, 

so (2.5) cannot be satisfied and, from Theorem 2.3, X is not a Bayes estimator. 

This argument will apply to any distribution for which the variance of X is 
independent of 9, such as the N(9, a 2 ) distribution in Example 2.2. However, if 
the variance is a function of 9, the situation is different. 

If var Xj = v(6), then (2.5) will hold only if 

(2.6) J v(0)dA(0)d9 = 0 

for some proper prior A. If v(0) > 0 (a.e. A), then (2.6) cannot hold. For example, 
if XX n are iid Bernoulli) p) random variables, then the risk function of the 
sample mean <5(£X/) = 'EX i /n is 

EiSiEXj)-p) 2 = P(l ~ P \ 
n 

and the left side of (2.5) is therefore 

P( 1 - p)dA(p). 

The integral is zero if and only if A assigns probability 1 to the set [0, 1}. For such 
a distribution. A, 

<5a(0) = 0 and <5 A («) = 1, 

and any estimator satisfying this condition is a Bayes estimator for such a A. 
Hence, in particular, X/n is a Bayes estimator. Of course, if A is true, then the 

values X = 1, 2,_, n — 1 are never observed. Thus, X/n is Bayes only in a rather 

trivial sense. I 

Extensions and discussion of other consequences of Theorem 2.3 can be found 
inBickel and Blackwell (1967), Noorbaloochi and Meeden (1983), and Bickel and 
Mallows (1988). See Problem 2.4. 
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The beta and normal prior distributions in the binomial and normal cases are the 
so-called conjugate families of prior distributions. These are frequently defined as 
distributions with densities proportional to the density of Pg. It has been pointed 
out by Diaconis and Ylvisaker (1979) that this definition is ambiguous; they show 
that in the above examples and, more generally, in the case of exponential families, 
conjugate priors can be characterized by the fact that the resulting Bayes estimators 
are linear in X. They also extend the weighted-average representation (1.13) of the 
Bayes estimator to general exponential families. For one parameter exponential 
families, MacEachern (1993) gives an alternate characterization of conjugate priors 
based on the requirement that the posterior mean lies ”in between” the prior mean 
and sample mean. 

As another example of the use of conjugate priors, consider the estimation of a 
normal variance. 


Example 2.5 Normal variance, known mean. Let X \,..., X n be iid according 
to N( 0, or 2 ), so that the joint density of the X,’s is Cx r e~ zT,x ‘, where r = l/2er 2 
and r = n /2. As conjugate prior for r, we take the gamma density T(g, 1 /a) noting 
that, by (1.5.44), 


(2.7) 


E(r)=~, 

a 


E(r 2 ): 


g(g + 1) 



(g- m-2y 


Writing y = E.r 2 , we see that the posterior density of r given the x,’s is 

C(y)r r+g - l e- T{a+y \ 


which is T[r + g, 1 /(a + y)]. If the loss is squared error, the Bayes estimator of 
2er 2 = 1/r is the posterior expectation of 1/r, which by (2.7) is (a + y)/(r + g— 1). 
The Bayes estimator of a 1 = 1 /2r is therefore 


( 2 . 8 ) 


a + Y 
n +2g — 2 


In the present situation, we might instead prefer to work with the scale invariant 
loss function 

(cl — a 2 ) 2 

(2-9) - 


which leads to the Bayes estimator (Problem 2.6) 


( 2 . 10 ) 


E(\/o 2 ) _ E(r) 
E( 1/a 4 ) 2E(r 2 )' 


and hence by (2.7) after some simplification to 


( 2 . 11 ) 


a + Y 
n + 2g + 2 


Since the Fisher information for a is proportional to 1/er 2 (Table 2.5.1), the 
Jeffreys prior density in the present case is proportional to the improper density 
1/a, which induces for r the density (1/r) aft. This corresponds to the limiting 
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case a = 0, g = 0, and hence by (2.8) and (2.11) to the Bayes estimators Y/(n — 2) 
and Y/(n + 2) for squared error and loss function (2.9), respectively. The first of 
these has uniformly larger risk than the second which is MRE. j 

We next consider two examples involving more than one parameter. 

Example 2.6 Normal variance, unknown mean. Suppose that we let ..., X n 

be iid as N(9, o 2 ) and consider the Bayes estimation of 9 and cr 2 when the prior 
assigns to r = 1 /2 a 2 the distribution T(g, 1 /a) as in Example 2.5 and takes 9 to 
be independent of r with (for the sake of simplicity) the uniform improper prior 
d6 corresponding to b = oo in Example 2.2. Then, the joint posterior density of 
(9, r) is proportional to 

(2 12) r >'+g-i g-rla+z+mx-e) 2 ] 

where z = E(.v; —x) 1 and r = n/ 2. By integrating out 9, it is seen that the posterior 
distribution of r is T[/- + g — 1/2, 1 /(a + z)] (Problem 1.12). In particular, for 
a = g = 0, the Bayes estimator of a 2 = l/2r is Z/{n — 3) and Z/(n + 1) for 
squared error and loss function (2.9), respectively. To see that the Bayes estimator 
of 9 is X regardless of the values of a and g, it is enough to notice that the posterior 
density of 9 is symmetric about X (Problem 2.9; see also Problem 2.10). 1 


A problem for which the theories of Chapters 2 and 3 do not lead to a satisfac¬ 
tory solution is that of components of variance. The following example treats the 
simplest case from the present point of view. 

Example 2.7 Random effects one-way layout. In the model (3.5.1), suppose for 
the sake of simplicity that [x and Z\\ have been eliminated either by invariance or 
by assigning to // the uniform prior on (—oo, oo). In either case, this restricts the 
problem to the remaining Z’s with joint density proportional to 

1 1 s j s n 

(jsOJ-i>((j2 + n< j2ys-\)/2 P 2 (er 2 + nerjj) H 2 cr 2 ' 2 

(2.13) 

The most natural noninformative prior postulates a and a A to be independent with 
improper densities 1 /a and 1 jo A , respectively. Unfortunately, however, in this 
case, the posterior distribution of (cr, a A ) continues to be improper, so that the 
calculation of a posterior expectation is meaningless (Problem 2.12). 

Instead, let us consider the Jeffreys prior A which has the improper density 
(l/a)(l/r) but with r 2 = a 2 + no\ so that the density is zero for r < o. (For 
a discussion of the appropriateness of this and related priors see Hill, Stone and 
Springer 1965, Tiao and Tan 1965, Box and Tiao 1973, Hobert 1993, and Hobert 
and Casella 1996.) The posterior distribution is then proper (Problem 2.11). The 
resulting Bayes estimator S A of o\ is obtained by Klotz, Milton, and Zacks (1969), 
who compare it with the more traditional estimators discussed in Example 5.5. 
Since the risk of S A is quite unsatisfactory, Portnoy (1971) replaces squared error 
by the scale invariant loss function (d — o^) 2 /(o 2 + no^) 2 , and shows the resulting 
estimator to be 
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where c = \(sn + 1), a = + 3), R = S 2 /{S 2 A + S 2 ), and 

f 1 v a 

F(R)= / - T d v. 

Jo [R + w(l - R)Y +1 

Portnoy’s risk calculations suggest that <5^ is a satisfactory estimator of cj 2 a for 
his loss function or equivalently for squared error loss. The estimation of a 2 is 
analogous. i 

Let us next examine the connection between Bayes estimation, sufficiency, and 
the likelihood function. Recall that if (Xi, Xo, ..., X„) has density f(x\, ..., x„\9), 
the likelihood function is defined by L{9 |x) = L(9\x\, ..., x n ) = f{x\, ..., x„ \9). 
If we observe T = t, where T is sufficient for 0, then 


fix i, ...,x n \9) = L(9\x) = g(t\9)h(x), 

where the function /;(•) does not depend on 9. For any prior distribution tx(0), the 
posterior distribution is then 

fix i ,... ,x„\9)tt(9) 

jt(9 x) = -= - 

/ f{xi,...,x n \e')n{0')d9' 

= L{9\x)jt{9) = g{t\9)n{9) 

f L{9'\x)n{9') d9' f g(t\e')7t{9') d9' 

so 7T (9 |x) = tt(9 |t), that is, n(9\x) depends on x only through t, and the posterior 
distribution of 9 is the same whether we compute it on the basis of x or of t. As an 
illustration, in Example 2.2, rather than starting with (2.1), we could use the fact 
that the sufficient statistic is X ~ N{9, a 2 /n) and, starting from 

f(x\9) oc , 


arrive at the same posterior distribution for 9 as before. Thus, Bayesian measures 
that are computed from posterior distributions are functions of the data only through 
the likelihood function and, hence, are functions of a minimal sufficient statistic. 

Bayes estimators were defined in (1.1) with respect to a proper distribution A. 
It is useful to extend this definition to the case that A is a measure satisfying 

(2.16) JdA(9) = oo, 

a so-called improper prior. It may then still be the case that (1.3) is finite for each 
x, so the Bayes estimator can formally be defined. 

Example 2.8 Improper prior Bayes. For the situation of Example 1.5, where 
X ~ b(p, n), the Bayes estimator under a beta(a, b) prior is given by (1.12). For 
a = b = 0, this estimator is x/n, the sample mean, but the prior density, n{p), 
is proportional to n(p) oc p~ l ( 1 — pV 1 , and hence is improper. The posterior 
distribution in this case is 


(2.17) 


(^) P x ~ l V ~ P) n - X ~ l 

/o P x ~^ 1 ~ p) n ~ x ~ l 


r(») 

r(.r)T(n — x) 


P x ~\ 1 - P)‘ 
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which is a proper posterior distribution if 1 < x < n — 1 with x/n the posterior 
mean. When x = 0 or x = n, the posterior density (2.17) is no longer proper. 
However, for any estimator 5(x) that satisfies <S(0) = 0 and S(n) = 1, the posterior 
expected loss (1.3) is finite and minimized at S(x) = x/n (see Problem 2.16 and 
Example 2.4). Thus, even though the resulting posterior distribution is not proper 
for all values of x, &{x) = x/n can be considered a Bayes estimator. j 

This example suggests the following definition. 

Definition 2.9 An estimator 8 n (x) is a generalized Bayes estimator with respect 
to a measure tt(0) (even if it is not a proper probability distribution) if the posterior 
expected loss, £{L(0, 5(X))|X = x}, is minimized at 8 = 8 K for all x. 

As we will see, generalized Bayes estimators play an important part in point 
estimation optimality, since they often may be optimal under both Bayesian and 
frequentist criteria. 

There is one other useful variant of a Bayes estimator, a limit of Bayes estimators. 

Definition 2.10 A nonrandomized 1 estimator <5(x) is a limit of Bayes estimators 
if there exists a sequence of proper priors it v and Bayes estimators such that 
^' (x) -> 5(x) a.e. [with respect to the density f{x\6)\ as v -> oo. 

Example 2.11 Limit of Bayes estimators. In Example 2.8, it was seen that the 
binomial estimator X/n is Bayes with respect to an improper prior. We shall now 
show that it is also a limit of Bayes estimators. This follows since 


(2.18) 


a + x x 

lim-= — 

a-> o a + b + n n 
b-± o 


and the beta(a, b) prior is proper if a > 0, b > 0. 


From a Bayesian view, estimators that are limits of Bayes estimators are some¬ 
what more desirable than generalized Bayes estimators. This is because, by con¬ 
struction, a limit of Bayes estimators must be close to a proper Bayes estimator. 
In contrast, a generalized Bayes estimator may not be close to any proper Bayes 
estimator (see Problem 2.15). 


3 Single-Prior Bayes 

As discussed at the end of Section 1, the prior distribution is typically selected from 
a flexible family of prior densities indexed by one or more parameters. Instead 
of denoting the prior by A, as was done in Section 1, we shall now denote its 
density by 7r(0\y), where the parameter y can be real- or vector-valued. (Hence, 
we are implicitly assuming that the prior tc is absolutely continuous with respect 
to a dominating measure fi(6), which, unless specified, is taken to be Lebesgue 
measure.) 

1 For randomized estimators the convergence can only be in distribution. See Ferguson 1967 (Section 
1.8) or Brown 1986a (Appendix). 
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We can then write a Bayes model in a general form as 

(3.i) x\e~f{x\e\ 

©Ik ~ 7t(e\y). 


Thus, conditionally on 9, X has sampling density f(x\9), and conditionally on y , 0 
has prior density jx{9\y). From this model, we calculate the posterior distribution, 
n(9\x, y), from which all Bayesian answers would come. The exact manner in 
which we deal with the parameter y or, more generally, the prior distribution 
jt(9\y) will lead us to different types of Bayes analyses. In this section we assume 
that the functional form of the prior, and the value of y, is known so we have 
one completely specified prior. (To emphasize that point, we will sometimes write 
K = Ko.) 

Given a loss function L(9, d), we then look for the estimator that minimizes 

(3.2) J L(9,d(x))7T(9\x,y Q )d9, 

where n{9\x, yo) = f{x\9)n{9\yo)/ / f{x\9)n{9\yo)d9. 

The calculation of single-prior Bayes estimators has already been illustrated in 
Section 2. Here is another example. 

Example 3.1 Scale uniform. For estimation in the model 


(3.3) 


X t \9 ~W(O,0), / = 


1 

— |a, ~ Gamma(a, 6), 


a, b known, 


sufficiency allows us to work only with the density of Y = max,- X,, which is 
given by g(y|0) = ny"~ l /9 n , 0 < y < 9. We then calculate the single-prior Bayes 
estimator of 9 under squared error loss. By (4.1.4), this is the posterior mean, given 
by 


(3.4) 


E(®\y, a , b) = 


f™e ¥ Lre- 1 ' eb d9 
I? jpLte-yn d9 • 


Although the ratio of integrals is not expressible in any simple form, calculation 
is not difficult. See Problem 3.1 for details. j 


In general, the Bayes estimator under squared error loss is given by 


(3.5) 


£(©!*) 


/ 9f{x\9)jt(9)d9 
J f(x\9)n(9)d9 


where X ~ f(x\9) is the observed random variable and © ~ tt(9) is the parameter 
of interest. While there is a certain appeal about expression (3.5), it can be difficult 
to work with. It is therefore important to find conditions under which it can be 
simplified. Such simplification is useful for two somewhat related purposes. 

(i) Implementation 

If a Bayes solution is deemed appropriate, and we want to implement it, we 
must be able to calculate (3.5). Thus, we need reasonably straightforward, 
and general, methods of evaluating these integrals. 
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(ii) Performance 

By construction, a Bayes estimator minimizes the posterior expected loss and, 
hence, the Bayes risk. Often, however, we are interested in its performance, 
and perhaps optimality under other measures. For example, we might examine 
its mean squared error (or, more generally, its risk function) in looking for 
admissible or minimax estimators. We also might examine Bayesian measures 
using other priors, in an investigation of Bayesian robustness. 

These latter considerations tend to lead us to look for either manageable expres¬ 
sions for or accurate approximations to the integrals in (3.5). On the other hand, 
the considerations in (i) are more numerical (or computational) in nature, leading 
us to algorithms that ease the computational burden. However, even this path can 
involve statistical considerations, and often gives us insight into the performance 
of our estimators. 

A simplification of (3.5) is possible when dealing with independent prior dis¬ 
tributions. If Xj ~ f(x\9j), i = 1, •••,«, are independent, and the prior is 
n(9u---,0 n ) = Y\i Tc(0j), then the posterior mean of 0, satisfies 

(3.6) E(9i\x i,.. ,,x n ) = E(9i\xi), 


that is, the Bayes estimator of 0, only depends on the data through x,. Although 
the simplification provided by (3.6) may prove useful, at this level of generality it 
is impossible to go further. 

However, for exponential families, evaluation of (3.5) is sometimes possible 
through alternate representations of Bayes estimators. Suppose the distribution of 
X = (Xi,..., X n ) is given by the multiparameter exponential family (see (1.5.2)), 
that is. 


(3.7) p v (x) = exp | ^2 t]i 7}(x) - A(j/) j h(x). 

Then, we can express the Bayes estimator as a function of partial derivatives with 
respect to x. The following theorem presents a general formula for the needed 
posterior expectation. 


Theorem 3.2 If X has density (3.7), and r) has prior density nit)), then for j = 
1 ,... ,n. 


(3.8) 


E 



hi 


dTfx) 

dXj 



— log m (x) - — log h(x), 

0Xj dXj 


where m(x) = f pyj (x)Tt(r)) dr] is the marginal distribution of X. Alternatively, 
the posterior expectation can be expressed in matrix form as 


(3.9) E (Trj) = Vlog m(x) — Vlog h(x), 


where T = {dTj/dxj}. 

Proof. Noting that 3 exp{J] ?;,■ Tj}/d.Xj = t7i(97//9*/)exp{^ hiTi}, we can 

write 



hi 


dTfx) 

dXj 



1 

m (x) 



hi 


37 } 
3X i 


e^m T i Mr l ) h(x)7T(ri)dr) 
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= 4 - / e~ A ^h(x)n(r,)dri 


m (x) 

n 

1 

f\ 

m (x) 

J l 

| 

( 3 

1 

(3a 

1 

3 


—e*">‘ T ‘h(x) 

dXi 


MiTt-A 


in (x) 3 Xj 


(l ^/z(x)) 7T(ri)dri 


4 /l(x) 1 

/?(x) m(x) 


'ZruTi-AO! 


h(x)n(ri) drj 


_ 4 W(X) a^ x > 

zzz(x) /l(x) 

where, in the third equality, we have used the fact that 

e w Kx)= ( ^u„r, /!(x)) _ c w, r. 

3x,- J 3x ; - L ° x j 

In the fourth equality, we have interchanged the order of integration and differen¬ 
tiation (justified by Theorem 1.5.8), and used the definition of m(x). Finally, using 
logarithms, E ip a ^ x) |x^ can be written as (3.8). □ 

Although it may appear that this theorem merely shifts calculation from one 
integral [the posterior of (3.5)] to another [the marginal m(x) of (3.8)], this shift 
brings advantages which will be seen throughout the remainder of this section 
(and beyond). These advantages stem from the facts that the calculation of the 
derivatives of log m(x) is often feasible and that, with the estimator expressed as 
(3.8), risk calculations may be simplified. Theorem 3.2 simplifies further when 
Ti = X h 

Corollary 3.3 IfX = (X\,.... X p ) has the density 
(3.11) p v (x) = e E <=i ’>‘*- A(> l ) h(x) 

and r) has prior density Tt(r]), the Bayes estimator of r) under the loss L(r ), 3) = 
X(? 7 i — Si) 2 is given by 


9 9 

(3.12) E(r), |x) = — log m(x) - — log h(x). 

OXj OXi 

Proof. Problem 3.3. 

Example 3.4 Multiple normal model. For 

Xi\0i ~ N(6i, a 2 ), i = l,.... p, independent, 

0, ~ N(p, r 2 ), i = 1 ,p, independent, 
where a 2 , r 2 , and p are known, rp = 9jla 2 and the Bayes estimator of 9, is 

9 0 

E(@i\x) = a 2 E(ip\x) = a 2 —log m(x) - — log h(x) 

OX; OX; 
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2 "> 

T er 

— 5 - + ~~2 - 2 ^. 

a- + r 2 ct- + r 2 


9 

9y; 


log m(x) = log ^e 2 '" 2 " 2 ’ ^ 


dx, 

~(Xi - m ) 


and 


log/i(x) = £- log (e ^ x < /a ^ = -^2 


d.Xi 


An application of the representation (3.12) is to the comparison of the risk of the 
Bayes estimator with the risk of the best unbiased estimator. 

Theorem 3.5 Under the assumptions of Corollary 3.3, the risk of the Bayes esti¬ 
mator (3.12), under the sum of squared error loss, is 

R[rj, £(»/|X)] = R[rj, — Vlog/i(X)] 


(3.13) 


A | a 2 / 9 

+ £ E \ 2 ^2 lo S w ( x ) + ( lo g™( x ) 


dXi 


Proof. By an application of Stein’s identity (Lemma 1.5.15; see Problem 3.4), it 
is straightforward to establish that for the situation of Corollary 3.3. 


~\ogh(X) 

oXi 


f 


— log h(x) 

OXi 


p v (x)dx = rji. 


Hence, if we write V log h(x) = [d/dxj log /i(x)}, 

(3.14) - V log h(X) = r]. 

Thus, —V log h(X) is an unbiased estimator of i] with risk 


R[ri, -Vlog/i(X)] = E n 


9 

m + xtt lo g h(X) 

dXi 


i2 


(3.15) 


£„l»/+Vlog/z(X)| 2 


which can also be further evaluated using Stein’s identity (see Problem 3.4). Re¬ 
turning to (3.12), the risk of the Bayes estimator is given by 


R[r,,E(r,\X)] = J2[ru-E(r,i\X)] 2 


i =1 
P 

£ 

i =1 


m 


log m (X) - log h(X) 

dXj oXi 


R[rj , —V log/*(X)] 


-2 E E 


i =1 


9 9 

0;, + — log/z(X))— logm(X) 

OX; 6Xi 
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(3.16) 


+ E E 


' 9 

JXi 


2 


log m (X) 


An application of Stein’s identity to the middle term leads to 



\( 9 \ 

9 


' 9 2 

En 

[(^ + 9x: l0g H 

1 — log HI (X) 
dX, 

= -E, 

ax 2 og, ” (X) 


which establishes (3.13). 


□ 


From (3.13), we see that if the second term is negative, then the Bayes estimator 
of rj will have smaller risk than the unbiased estimator — Vlog/j(X), (which is 
best unbiased if the family is complete). We will exploit this representation (3.13) 
in Chapter 5, but now just give a simple example. 


Example 3.6 Continuation of Example 3.4. To evaluate the risk of the Bayes 
estimator, we also calculate 


dxf 


1 

log m(x) =- - -- 

(7 ~ + T" 


and hence, from (3.13), 


R[ri, E(ij|X)] = R[rj, — Vlog/i(X)] 


2 p 


(3.17) 

CT“ + Z- 

The best unbiased estimator of ip = 0; / a 2 is 

9 

~8X 


' E E n 


X,-p 


9 X, 

— log A(X)=-i 


with risk R(ri , —Vlog/z(X)) = pjo 1 . If= fi for each/.then the Bayes estimator 
has smaller risk, whereas the Bayes estimator has infinite risk as \ry — fi\ -> oo 
for any i (Problem 3.6). j 


We close this section by noting that in exponential families there is a general 
expression for the conjugate prior distribution and that use of this conjugate prior 
results in a simple expression for the posterior mean. For the density 

(3.18) p, t (x) = e' 1x ~ A( ' 1) h{x), —oo < x < oo, 

the conjugate prior family is 

(3.19) 7t(r)\k, p ) = c(k, p)e k ^~ kA(ri) , 

where p can be thought of as a prior mean and k is proportional to a prior variance 
(see Problem 3.9). 

If Xi, ..., X„ is a sample from p n (x) of (3.18), the posterior distribution result¬ 
ing from (3.19) is 

jrfalx, k, p) oc ] 

(3 20 ) = e <l( n x +k n)-(n+k)A(ii) 
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which is in the same form as (3.19) with /x' = (nx + kfP)/(n + k) and k' = n + k. 
Thus, using Problem 3.9, 

, nx + kix 

(3.21) E(A'(i])\x,k, n) = - f. 

n + k 

As EX\ri = A'(ri), we see that the posterior mean is a convex combination of the 
sample and prior means. 

Example 3.7 Conjugate gamma. Let X \,..., X n be iid as Gamma(«, b), where 
a is known. This is in the exponential family form with ri = — 1 /b and A(rf) = 
—a log(— rj). If we use a conjugate prior distribution (3.19) for b for which 


£(A'(i 7 )|x) = E 



nx + k[jb 


n + k 


The resulting Bayes estimator under squared error loss is 


(3.22) 


E(b\x) = - 
a 



This is the Bayes estimator based on an inverted gamma prior for b (see Problem 
3.10). || 


Using the conjugate prior (3.19) will not generally lead to simplifications in 
(3.9) and is, therefore, not helpful in obtaining expressions for estimators of the 
natural parameter: However, there is often more interest in estimating the mean 
parameter rather than the natural parameter. 


4 Equivariant Bayes 

Definition 3.2.4 specified what is meant by an estimation problem being invariant 
under a transformation g of the sample space and the induced transformations g 
and g* of the parameter and decision spaces, respectively. In such a situation, when 
considering Bayes estimation, it is natural to select a prior distribution which is 
also invariant. 

Recall that a group family is a family of distributions which is invariant under 
a group G of transformations for which G is transitive over the parameter space. 
We shall say that a prior distribution A for 9 is invariant with respect to G if the 
distribution of gO is also A for all g e G; that is, if for all g e G and all measurable 
B 

(4.1) P A (gd eB) = P a (9 e B) 
or, equivalently, 

(4.2) A(g~ l B) = A(B). 

Suppose now that such a A exists and that the Bayes solution S A with respect to it 
is unique. By (4.1), any <5 then satisfies 

(4.3) [R(6,8)dA(9)= f R(g6, 8) dA(d). 
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Now 


R(g6, 8 ) = Ege{L[gd, S(X)]} = E e {L[gO, S(gX)]} 

(4.4) = E e {L[6, g*” 1 ^*)]}. 

Here, the second equality follows from (3.2.4) (invariance of the model) and the 
third from (3.2.14) (invariance of the loss). On substituting this last expression into 
the right side of (4.3), we see that if 5 a(v) minimizes (4.3), so does the estimator 
g*~ ] 8\(gx). Hence, if the Bayes estimator is unique, the two must coincide. By 
(3.2.17), this appears to prove <5 a to be equivariant. However, at this point, a 
technical difficulty arises. Uniqueness can be asserted only up to null sets, that 
is, sets N with Pg(N) = 0 for all 0. Moreover, the set N may depend on g. An 
estimator 8 satisfying 

(4.5) 8(x) = g*~ l 8(gx) for all x $ N g 

where Pg(N g ) = 0 for all 0 is said to be almost equivariant. We have therefore 
proved the following result. 

Theorem 4.1 Suppose that an estimation problem is invariant under a group and 
that there exists a distribution A over Q such that (4.2) holds for all (measurable) 
subsets B of £1 and all g e G. Then, if the Bayes estimator 8 A is unique, it is 
almost equivariant. 

Example 4.2 Equivariant binomial. Suppose we are interested in estimating 
p under squared error loss, where X ~ binomial);? , p). A common group of 
transformations which leaves the problem invariant is 

gX = n — X , 
gP = 1 - P- 

For a prior A to satisfy (4.1), we must have 

(4.6) PxigP < t)= P A (p < t ) for all t. 

If A has density y(p), then (4.1) implies 

(4.7) f y(p)dp = f y(\ — p)dp for all t, 

Jo Jo 

which, upon differentiating, requires y(t) = y(\ — t) for all t and, hence, that y(t) 
must be symmetric about t = 1/2. It then follows that, for example, a Bayes rule 
under a symmetric beta prior is equivariant. See Problem 4.1. 


The existence of a proper invariant prior distribution is rather special. More 
often, the invariant measure for 6 will be improper (if it exists at all), and the 
situation is then more complicated. In particular, (i) the integral (4.3) may not be 
finite, and the argument leading to Theorem 4.1 is thus no longer valid and (ii) it 
becomes necessary to distinguish between left- and right-invariant measures A. 
These complications require a level of group-theoretic sophistication that we do 
not assume. However, for the case of location-scale, we can develop the theory in 
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sufficient detail. [For a more comprehensive treatment of invariant Haar measures, 
see Berger 1985 (Section 6.6), Robert 1994a (Section 7.4), or Schervisch 1995 
(Section 6.2). A general development of the theory of group invariance, and its 
application to statistics, is given by Eaton (1989) and Wijsman (1990).] 

To discuss invariant prior distributions, or more generally invariant measures, 
over the parameter space, we begin by considering invariant measures over groups. 
(See Section 1.4 for some of the basics.) Let G be a group and C be a a -field of 
measurable subsets of G, and for a set B in £, let 

Bh = {gh : g e B) 

and 

gB = {gh : h e B}. 

Then, a measure A over (G, €.) is right-invariant, a right invariant Haar measure 
if 

(4.8) A (Bh) = A (B) for all B e C, h e G 
and a left-invariant Haar measure if 

(4.9) A (gB) = A(B) for all B e £, g e G. 

In our examples, measures satisfying (4.8) or (4.9) exist, have densities, and 
are unique up to multiplication by a positive constant. We will now look at some 
location-scale examples. 

Example 4.3 Location group. Forx = (xi,..., x n ) in a Euclidean sample space, 
consider the transformations 

(4.10) gx = (x 1 +g,...,x n +g), -oo < g < oo, 
with the composition (group operation) 

(4.11) g ° h = g + h, 

which was already discussed in Sections 1 and 2. Here, G is the set of real numbers 
g , and for C, we can take the Borel sets. The sets Bh and gB are 

Bh = {g + h : g e B} and gB = {g + h : h e B} 

and satisfy Bg = gB since 

(4.12) g o h = h o g. 

When (4.12) holds, the group operation is said to be commutative; groups with this 
property are called Abelian. For an Abelian group, if a measure is right invariant, it 
is also left invariant and vice versa, and will then be called invariant. In the present 
case, Lebesgue measure is invariant since it assigns the same measure to a set B 
on the line as to the set obtained by translating B by any fixed account g or h to the 
right or left. (Abelian groups are a special case of unimodular groups, the type of 
group for which the left- and right-invariant measures agree. See Wijsman (1990, 
Chapter 7) for details. i 

There is a difference between transformations acting on parameters or on ele¬ 
ments of a group. In the first case, we know what we mean by gO, but Og makes 
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no sense. On the other hand, for group elements, multiplication is possible both 
on the left and on the right. 

For this reason, a prior distribution A (a measure over the parameter space) sat¬ 
isfying (4.1), which only requires left-invariance, is said to be invariant. However, 
for measures over a group one must distinguish between left- and right-invariance 
and call such measures invariant only if they are both left and right invariant. 

Example 4.4 Scale group. For x = (xi , ..., x n ) consider the transformations 


gx = (gx 1 , • • •, gx n ), 0 < g < 00 , 


that is, multiplication of each coordinate by the same positive number g, with the 
composition 

g o h = g x h. 


The sets Bh and gB are obtained by multiplying each element of B by h on the 
right and g on the left, respectively. Since gh = hg, the group is Abelian and the 
concepts of left- and right-invariance coincide. An invariant measure is given by 
the density 

(4.13) -dg. 

8 


To see this, note that 


MB) = 



^ dg ' = f -dg' = A (Bh), 
dg JBli 8 


where the first equality follows by making the change of variables g' = gh. 


Example 4.5 Location-scale group. As a last and somewhat more complicated 
example, consider the group of transformations 

gx = (ax i + /?,..., ax„ + b), 0 < a < oo, —oo < h < oo. 

If g = (a, b) and h = (c, d), we have 

hx = cx + d 


and 


ghx = a(cx + d) + b = acx + (ad + b). 
So, the composition rule is 


(4.14) (a, b) o (c, d) = ( ac , ad + b). 

Since 

(c, d) o (a, b ) = (ac, cb + d), 

it is seen that the group operation is not commutative, and we shall therefore have 
to distinguish between left and right Haar measures. We shall now show that these 
are given, respectively, by the densities 

1 1 

(4.15) —dadb and —dadb. 

a~ a 
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To show left-invariance of the first of these, note that the transformation g(h) = goh 
takes 

(4.16) h = (c,d ) into (c = ac, cl' = ad + b). 

If A has density dadb/a 2 , we have 


(4.17) 


A (B) = A {(ac, ad + b) : (c, d) e B} 


and, hence, 

(4.18) 

where 


A (B) = [ — dcdd = f 
J B c JgB 


a 2 3(c, d) 
(c)2 3(c', d') 


dc'dd’, 


d(c', d') _ a 0 
3 (c,d) ~ 0 a 


is the Jacobian of the transformation (4.16). The right side of (4.18) therefore 
reduces to 

and thus proves (4.9). 

To prove the right-invariance of the density dadb/a, consider the transformation 
h(g) = g oh taking 


(4.19) g = (a,b ) into (a' = ac,b' = ad + b). 

We then have 

f I C r tUa 

-da'db'. 


(4.20) 


MB) 


= f -dadb = f 

J b a J bi, 


l B h a' d(a', b’) 

The Jacobian of the transformation (4.19) is 


d(a', b') 


c d 
0 1 


c. 


3 (a,b) 

which shows that the right side of (4.20) is equal to A(Bh). 


We introduced invariant measures over groups as a tool for defining measures 
over the parameter space Q that, in some sense, share these invariance properties. 
For this purpose, consider a measure A over a transitive group G that leaves 
invariant. Then, A induces a measure A' by the relation 

(4.21) A'(tt>) = A{g e G : gdo e w}, 

where co is any subset of Q, and 0 () any given point of f2. A disadvantage of this 
definition is the fact that the resulting measure A' will typically depend on 0q, 
so that it is not uniquely defined by this construction. However, this difficulty 
disappears when A is right invariant. 

Lemma 4.6 If G is transitive over Q, and A is a right-invariant measure over G, 
then A' defined by (4.21) is independent of 9 q. 
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Proof. If 9\ is any other point of Q, there exists (by the assumption of transitivity) 
an element h of G such that 9\ = Ii6q. Let 

A "(co) = A{g e G : gOi e co} 

and let B be the subset of G given by 

B = {g : g0 o e co}. 


Then, 

{g : gh e B} = {gfr 1 : g e B} = Bli~ l 

and 

A "(co) = A{g : ghO o e co} = Afg/j ' 1 : g9 0 e co] 
= A (Bh~ l ) = A (B) = A '(co), 


where the next to last equation follows from the fact that A is right invariant. □ 

Example 4.7 Continuation of Example 4.3. The group G of Example 4.3 given 
by (4.10) and (4.11) and (1.2) of Section 1 induces on £2 = {?; : —oo < i] < oo} 
the transformation 

89 = 9 + 8 

and, as we saw in Example 4.3, Lebesgue measure A is both right and left invariant 
over G. For any point >]q and any subset co of Q, we find 

A '(co) = A{g e G : r]o + g e co} = A{g e G : g e co - 170 }, 


where co— r ]0 denotes the set co translated by an amount ?;o- Since Lebesgue measure 
of co — ijo is the same as that of co, it follows that A' is Lebesgue measure over £2 
regardless of the choice of ?]q. 

Let us now determine the Bayes estimates for this prior measure for when the 
loss function is squared error. By (1.6), the Bayes estimator of ;/ is then (Problem 
4.2) 

f uf(x 1 — u,, x n — u) dii 


(4.22) 


Six)-- 


f f(x 1 — u ,..., x n — u) du 
This is the Pitman estimator (3.1.28) of Chapter 3, which in Theorem 1.20 of that 
chapter was seen to be the MRE estimator of j 


Example 4.8 Continuation of Example 4.4. The scale group G given in Ex¬ 
ample 4.4 and by (3.2) of Section 3.3 induces on£2 = {r:0<r < 00 } the 
transformations 

(4.23) got = g xr 

and, as we saw in Example 4.4, the measure A with density ^dg is both left and 
right invariant over G. For any point To and any subset co of f2, we find 
A'ico) = A{g e G : gr Q e co} = A{g e G : g e co/ r 0 }, 


where co/tq denotes the set of values in co each divided by to. The change of 
variables g ' = tog shows that 
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and, hence, that 


A'(oj) 


/ 

J CO 


dr 

r 


Let us now determine the Bayes estimator of r for this prior distribution when the 
loss function is 


L(r, d) ■ 


(d — r ) 2 


This turns out to be the Pitman estimator (3.19) of Section 3.3 with r = 1, 


(4.24) 


/ 0 °° v n f(vx i,..., vx„)dv 
/o°° v" + 1 f(vx i, ..., vx„)dv' 


which is also MRE (Problems 3.3.17 and 4.3). 


Example 4.9 Continuation of Example 4.5. The location-scale family of distri¬ 
butions (3.23) of Section 3.3 remains invariant under the transformations 

gx : x'j = a + bxi , — oo < a < oo, 0 < b, 

which induce in the parameter space 

£2 = {(??, r) : —oo < rj < oo, 0 < r} 

the transformations 

(4.25) rj'=ar] + b, r' = br. 

It was seen in Example 4.5 that the left and right Haar measures Ai and A 2 over 
the group G = {g = (a, b) : —00 < a < 00 , 0 < b] with group operation (4.14) 

are given by the densities 

1 1 

(4.26) — dadb and —dadb, 

a 2 a 

respectively. 

Let us now determine the corresponding measures over £2 induced by (4.19). If 
we describe the elements g in this group by (a, b), then for any measure A over 
G and any parameter point (t]q, r (l ), the induced measure A' over £2 is given by 

(4.27) A'(tt>) = A{(< 7 , b ) : (arj () + b, bro) e co}. 


Since a measure over the Borel sets is determined by its values over open intervals, 
it is enough to calculate A'(tt>) for 

(4.28) oo : r)\ < r] < r) 2 , r\ < r < T 2 . 


If, furthermore, we assume that A has a density X 


, L r 2 

A (o>) = A{(< 7 , b) : rj\ — ai)o < b < rji — ar/Q, — < a < —} 

h) To 

da. 

In this integral, let us now change variables from (a, b ) to 
a'= ax 0 , b' = b + arii). 


r T 2/ T o 

- pr/2-aTio 

/ 

1 X(a, b)db 

Gi/To 

Jr)\-ar)o 
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The Jacobian of the transformation is 


d(a\ b') 


To fio 

3 (a,b) 


0 1 


TO 


and we, therefore, find 

»t 2 

A'(co) 


r *2 r rj 2 M 2 M 2 

/ / X'(a,b)dadb= / / 

J T\ J ri\ J T\ J ri] 


1 , , 

X(a, b)—da db , 

'll T o 


with a = a'/to and b = b’ — ar/o = b' — a'/to- We can therefore take 


, , , 1 
k'(a', b') = a 
to 


—, b' - 

TO To 


Now consider the following two cases: 
(i) For k(a, b) = -, 


, , , 1 To 1 

A. (a , b) = 

Tq a ' a 1 


... 1 t,? 1 

X\a’,b')= - ° 7 = ~. 

To a' a' 


(4.29) 

(ii) for A(a, b) = jy, 

(4.30) 

The Bayes estimators of r] and r corresponding to (4.29) are (Problem 4.4) 

(4.31) 

and 

(4.32) 


r°° r°° u r / x\—u 
* _ J —OO Jo V n+ 3 J V V ’ 

■ •, dv du 

r l — r-O 0 j-oo 1 , / X\ —U 

J—oo Jo v n+ 3 J V v ’ 

■ •, dv du 

roo roo l r f X\— U 

+■ _ J —oo Jo v n+2 J V v ’ 

■, dv du 

POO POO 1 p / xj— u 

J —oo J 0 v n+ 3 J V v ’ 

-,^r) dvdu 


These turn out to be the MRE estimators of r] and r under the loss functions 
(d — t]) 2 /t 2 and (d — t) 2 /t 2 , respectively. ii 


The treatment of these three examples extends to a number of other important 
cases (see, for example. Problems 4.6 and 4.7) and suggests that the Bayes estimator 
with respect to the measure induced by right Haar measure over G is equivariant 
and is, in fact, MRE. For conditions under which these conclusions are valid, see 
Berger 1985 (Section 6.6), Robert 1994a (Section 7.4), or Schervish 1995 (Section 
6.2); a special case is treated in Section 5.4. It is also worth noting that if a Haar 
measure A over a group G is finite, that is, A(G) < oo, then left and right Haar 
measures coincide. 

At the beginning of the section, we defined invariance of a prior distribution by 
(4.1) and (4.2), and the same equations define invariance of a measure over even 
if it is improper. We shall now consider whether the measure A' induced over !T2 
by left- or right-invariant Haar measure is invariant in this sense. 

Example 4.10 Invariance of induced measures. We look at the location-scale 
groups and consider invariance of the induced measures. 
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(i) Location Group. We saw in Example 4.7 that left and right Haar measures A 
coincide in this case and that A' is Lebesgue measure which clearly satisfies 
(4.2) 

(ii) Scale Group. Again, left and right Haar measure A coincide, and by Example 
4.4, A'(tt>) = J m Since £“'(&>) = t o/g , as in Example 4.4, 

A'[g“ 1 («)]= f —=fidg = A(co) 

J a)/g t J oi § 

so that A' is invariant. 

(iii) Location-Scale Group. Here, the densities induced by the left- and right- 

invariant Haar measures are given by (4.30) and (4.29), respectively. Calcu¬ 
lations similar to those of Example 4.9 show that the former is invariant but 
the latter is not j 

The general situation is described in the following result. 

Theorem 4.11 Under the assumptions of Theorem 4.1, the measure A' over £2 
induced by a measure A over G is invariant provided A is left invariant. 

Proof. For any co and 0q e £2, let B = {h e G : h6o e &>}, so that A '(at) = A (B). 
Then, gB = {gh : 1i6q e co} and 

A '(geo) = A(h : lido e geo) = A (h : g~ l h6o e co) 

= A (gh : hd Q ew) = A (gB). 

Thus, A (geo) = A '(co) if and only if A (gB) = A (B), and it follows that A' is 
invariant if and only if A is left invariant. □ 

Note that this result does not contradict the remark made after Example 4.9 to the 
effect that the Bayes estimator under the prior measure induced by right-invariant 
Haar measure is equivariant. A Bayes estimator can be equivariant under a prior 
measure A even if A is not invariant (see Problem 4.8). 

When there are no groups leaving the given family of distributions invariant, no 
Haar measure is available to serve as a noninformative prior. In such situations, 
transformations that utilize some (perhaps arbitrary) structure of the parameter 
space may sometimes be used to deduce a form for a “noninformative” prior 
(Villegas 1990). A discussion of these approaches is given by Berger 1985 (Section 
3.3); see also Bernardo and Smith 1994 (Section 5.6.2). 

5 Hierarchical Bayes 

In a hierarchical Bayes model, rather than specifying the prior distribution as a 
single function, we specify it in a hierarchy. Thus, we place another level on the 
model (3.1), and write 


(5.1) 


x\e ~ me), 

@1 Y ~ tt(d\y), 
r ~ f(y), 
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where we assume that is known and not dependent on any other unknown 
hyperparameters (as parameters of a prior are sometimes called). Note that we can 
continue this hierarchical modeling and add more stages to the model, but this is 
not often done in practice. The class of models (5.1) appears to be more general 
than the class (3.1) since in (3.1), y has a fixed value, but in (5.1), it is permitted to 
have an arbitrary probability distribution. However, this appearance is deceptive. 
Since it(9 \ y) in (3.1) can be any fixed distribution, we can, in particular, take for it 
jr(9) = f 7r(0|y)i/f(y)t/)/,whichreducesthehierarchicalmodel (5.1) to the single¬ 
prior model (3.1). However, there is a conceptual and practical advantage to the 
hierarchical model, in that it allows us to model relatively complicated situations 
using a series of simpler steps; that is, both jt{9\y) and ir{y) may be of a simple 
form (even conjugate), but tc{9) may be more complex. Moreover, there is often a 
computational advantage to hierarchical modeling. We will illustrate both of these 
points in this section. 

It is also interesting to note that this process can be reversed. Starting from the 
single-prior model (3.1), we can look for a decomposition of the prior n{9) of the 
form^(O) = / 7T(0\y)is(y)dy and thus create the hierarchy (5.1). Such modeling, 
known as hidden Markov models, hidden mixtures, or deconvolution, has proved 
very useful (Churchill 1989, Robert 1994a (Section 9.3), Robert and Casella 1998). 

Given a loss function L(9, d), we would then determine the estimator that min¬ 
imizes 

(5.2) J L(9, d(x))n(6\x)d6 

where n(9\x) = f f(x\9)n(9\y)\l/(y)dy/ff f(x\9)n(9\y)ir(y)d6 dy.Notealso 
that 

(5.3) n(9\x) = J jt(9\x, y)n{y\x)dy 

where n(y |x) is the posterior distribution of T, unconditional on 9. We may then 
write (5.2) as 


(5.4) 


J L(9, d(x))jt(6\x)d9 



L(9, d(x))jt(9\x, y)d9 


n(y\x)dy, 


which shows that the hierarchical Bayes estimator can be thought of as a mixture 
of single-prior Bayes estimators. (See Problems 5.1 and 5.2.) 

Hierarchical models allow easier modeling of prior distributions with “flatter” 
tails, which can lead to Bayes estimators with more desirable frequentist properties. 
This latter end is often achieved by taking i //(■) to be improper (see, for example, 
Berger and Robert 1990, or Berger and Strawderman 1996). 


Example 5.1 Conjugate normal hierarchy. Starting with the normal distribution 
and modeling, each stage with a conjugate prior yields the hierarchy 

Xj\9 ~ N(9, a 2 ), o 1 known, i - 1 

(5.5) 0|r ~ N(0, x 2 ) 

1 


Gamma(a, i), a, b known. 
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The hierarchical Bayes estimator of 0 under squared error loss is 
£(©|x) = JJ 0jt(9\x, r 2 )ddn(r 2 \x)dr 2 

(5.6) = [ Y - Jt(r 2 \x)dr 2 

J nr+ a z 

= E[E(@\x, r 2 )], 

which is the expectation of the single-prior Bayes estimator using the density 
7t(t 2 |x). (See Problem 5.3 for the form of the posterior distributions.) Although 
there is no explicit form for £(©|x), calculation is not particularly difficult. 


It is interesting to note that even though at each stage of the model (5.5) a 
conjugate prior was used, the resulting Bayes estimator is not from a conjugate 
prior (the prior 7t(0\a,b) = j jt(6\T)tfr(r\a, b)dr is not conjugate) and is not 
expressible in a simple form. Such an occurrence is somewhat commonplace in 
hierarchical Bayes analysis and leads to more reliance on numerical methods. 

Example 5.2 Conjugate normal hierarchy, continued. As a special case of the 
model (5.5), consider the model 

Xj \0 ~ N(0, er 2 ), i = 1,..., p, independent, 

(5.7) @|r 2 ~ N(0, r 2 ) 

1 (v 2 

—r ~ Gamma I —, — 

T 2 V 2 V 


This leads to a Student’s f-prior distribution on ©, and a posterior mean 


(5.8) 


£[©|i] 


JZ 0(1 + e 2 /v)-"fe- pl2aHe -' x)2 de 
/^(l + 0 2 /v)~^ e~ p ^ 2a2 ^ e ~ x)2 d0 


which is not expressible in a simple form. Numerical evaluation of (5.8) is simple, 
so calculation of this hierarchical Bayes estimator in practice poses no problem. 
However, evaluation of the mean squared error or Bayes risk of (5.8) presents a 
more substantial task. I 


In the preceding example, the hierarchical Bayes estimator was expressible as a 
ratio of integrals which easily yielded to either direct calculation or simple approx¬ 
imation. There are other cases, however, in which a straightforward hierarchical 
model can lead to very difficult problems in evaluation of a Bayes estimator. 

Example 5.3 Beta-binomial hierarchy. A generalization of the standard beta- 
binomial hierarchy is 

X\p ~ binomial (p, n ), 

(5.9) p\a, b ~ beta (a, b), 

(a, b ) ~ \j/(a, b ), 
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leading to the posterior mean 


where 

(5.11) 7T(p)= ff + T* p“ '(\ - p) b ~ l \!r(a, b) da db. 

JJ r(a)r(Z?) 

For almost any choice of f{a, b ), calculation of (5.11), and hence (5.10), is quite 
difficult. Indeed, there could be difficulty with numerical integration, simulation, 
and approximation. Moreover, if f(a, b) is chosen to be improper, as is typical 
in such hierarchies, the propriety of 7t(p\x) is not easy to verify (and often does 
not obtain). George et al. (1993) provide algorithms for calculating expressions 
such as (5.10), and Hobert (1994) establishes conditions for the propriety of some 
resulting posterior distributions. j 

To overcome the difficulties in computing hierarchical Bayes estimators, we 
need to establish either easy-to-use formulas or good approximations, in order to 
further investigate their risk optimality. The approximation issue will be addressed 
in the next section. In the remainder of the present section, we consider the evalu¬ 
ation of (3.5) using theory based on Markov chain limiting behavior (see Note 9.4 
for a brief discussion). Although this theory does not result in a simple expression 
for the Bayes estimators in general, it usually allows us to write expressions such 
as (5.6) as a limit of simple estimators. (Technically, these computations are not 
approximations, as they are exact in the limit. However, since they involve only 
a finite number of computations, we think of them as approximations, but realize 
that any order of precision can be achieved.) The resulting techniques, collectively 
known as Markov chain Monte Carlo (MCMC) techniques (see Tanner 1996, Gilks 
et al. 1996, or Robert and Casella 1998) can greatly facilitate calculation of a hier¬ 
archical Bayes estimator. One of the most popular of these methods is known as the 
Gibbs sampler [brought to statistical prominence by Gelfand and Smith (1990)], 
which we now illustrate. 

Starting with the hierarchy (5.1), suppose we are interested in calculating the 
posterior distribution tt(0|x) (or £(0|x), or some other feature of the posterior 
distribution). From (5.1) we calculate the full conditionals 

(5.12) @|x, y ~ tt(0|x, y), 

T|x, 9 ~ 7t(y\x, 9), 

which are the posterior distributions of each parameter conditional on all others. 
If, for i = 1,2,..., M, random variables are generated according to 

(5.13) ©, |x, Yi _i ~ jr(0|x, Yi- 1 ), 

r, |x, 9i ~ 7r(y|x, 9i), 

this defines a Markov chain (©,■, T, ). It follows from the theory of such chains (see 
Note 9.4) that there exist distributions jr(0|x) and jr(y |x) such that 


(5.10) 


E(p\x) = 


p x+1 ( i - P y- x 7T(p)d P 

fo p x ( 1 - p) n ~ x Tt{p)dp 
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r, 4 r 


*(y |x) 


oo, and 


(5.15) 


i M r 

— £ /*(©,) E(h(&)\x) = j 


Ii(6)tt(0\x) clO 


as M a oo. (A full development of this theory is given in Meyn and Tweedie 
(1993). See also Resnick 1992 for an introduction to Markov chains and Robert 
1994a for more applications to Bayesian calculation.) 

It follows from (5.15) that for ©,■ generated according to (5.13), we have 

M 


(5.16) 


1 


„ X> - E < 0 w, 


/=! 


the hierarchical Bayes estimator. (Problems 5.8 - 5.11 develop some of the more 
practical aspects of this theory.) 

Example 5.4 Poisson hierarchy with Gibbs sampling. As an example of a Pois¬ 
son hierarchy (see also Example 6.6), consider 

X|A. ~ Poisson(A.) 

(5.17) A|f> ~ Gamma(fl, b), a known 

1 

- ~ Gamma(k, r), 
b 

leading to the full conditionals 


(5.18) 


A|x, b ~ Gamma I a + x 


1 +b 


1 / 

-lx, A. ~ Gamma I a + k, 

b V 1 + 


T —). 
- Xr / 


Recall that in this hierarchy, tt(A,|x) is not expressible in a simple form. However, 
if we simulate from (5.18), we obtain a sequence {A/} satisfying 


(5.19) 


i M r 

Mp (A 'W 


h(X)n(X\x) dX = E^Alx)]. 


Alternatively, we could use a {£>,■} sequence and calculate 

M 

(5.20) —Y n(X\x,bi ) —»■ / n(X\x,b)n(b\x)db = n(X\x). 


1 C 

-y„iX\x,bi)^ J: 


The Gibbs sampler actually yields two methods of calculating the same quantity. 
For example, from the hierarchy (5.1), using the full conditionals of (5.12) and the 
iterations in (5.13), we could estimate E(h(&)\x) by 


(0 


1 

M 


M 

5>(0i) 

i=1 


J h(6)n(6\x) dO = E(h(0)\x) 
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(5.21) 

(ii) 


or by 


M 


— Y^Eh{®\x,Ti) 


i =1 


J E(h(@)\x , y)jx(y\x)dy = £(/;(©)|x). 


Implementation of the Gibbs sampler is most effective when the full condition¬ 
als are easy to work with, and in such cases, it is often possible to calculate 
E(h(@)\x, T/) in a simple form, so (5.21)(ii) is a viable option. To see that it is 
superior to (5.21)(i), write 


E(h(@)\x) = E[E(h(®)\x, y)] 


and apply the Rao-Blackwell theorem (see Problem 5.12). 

Example 5.5 Gibbs point estimation. To calculate the hierarchical Bayes esti¬ 
mator of X in Example 5.4, we use 


j M 

— J]E(A| x,hi) 

i = l 


i M 

m y 


bj 

1 +bi 


(a + x) 


rather than (1/M) Ylfti A;. Analogously, the posterior density 7t(X\x) can be cal¬ 
culated by 


j M 

rt(X\x) = — y 7t(X\x, bi) 

i =1 


■^a+x— 1 

MT(fl + x) 



e 


, (i+b,) 

' A ~bT 


The actual implementation of the Gibbs sampler relies on Monte Carlo tech¬ 
niques to simulate random variables from the distributions in (5.13). Very efficient 
algorithms for such simulations are available, and Robert (1994a, Appendix B) 
catalogs a number of them. There are also full developments in Devroye (1985) 
and Ripley (1987). (See Problems 5.14 and 5.15.) 

For many problems, the simulation step is straightforward to implement on a 
computer so we can take M as large as we like. This makes it possible for the 
approximations to have any desired precision, with the only limiting factor being 
computer time. (In this sense we are doing exact calculations.) Many applications 
of these techniques are given in Tanner (1996). 

As a last example, consider the calculation of the hierarchical Bayes estimator 
of Example 5.2. 


Example 5.6 Normal hierarchy. From (5.5), we have the set of full conditionals 


(5.22) 


9 \x, T~ 


-T- \x . 0 


N 




^ / 1 (0 2 1 

Gamma I a + -, | 1 — 

\ 2 ’ ~ 


-IN 
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Note that, conditional on 9, r 2 is independent of x. Both of these conditional 
distributions are easy to simulate from, and we thus use the Gibbs sampler to 
generate a chain (©,, r 2 ), i = l,M, from (5.22). This yields the approximation 


(5.23) 


as M 


oo. 


1 M 

E(®\x) = — £>(©|i,r 2 ) 

M i =I 


1 r 2 


M 

M 

£(©|jf) 


-Y 

M 


j r 2 + na 2 


As mentioned before, one of the purposes of specifying a model in a hierarchy 
is to make it possible to model more complicated phenomena in a sequence of 
less complicated steps. In addition, the ordering in the hierarchy allows us both to 
order the importance of the parameters and to incorporate some of our uncertainty 
about the prior specification. 

To be precise, in the model 

x\e~ f{ X \e\ 

(5.24) ©|k~jr(0|A.), 

A ~ 1 jf(X), 


we tend to be more exacting in our specification of jt(9\X), and less so in our 
specification of i f{X). Indeed, in many cases, i jr(X) is taken to be “flat” or “non- 
informative” (for example, f(X) = Lebesgue measure). In practice, this leads to 
heavier-tailed prior distributions 7t(0), with the resulting Bayes estimators being 
more robust (Berger and Robert 1990, Fourdrinier et al 1996; see also Example 
5.6.7.). 

One way of studying the effect that the stages of the hierarchy (5.24) have 
on each other is to examine, for each parameter, the information contained in its 
posterior distribution relative to its prior distribution. In effect, this measures how 
much the data can tell us about the parameter, with respect to the prior distribution. 

To measure this information, we can use Kullback-Leibler information (recall 
Example 1.7.7), which also is known by the longer, and more appropriate name, 
Kullback-Leibler information for discrimination between two densities. For den¬ 
sities / and g, it is defined by 


(5.25) 



f(t)dt. 


The interpretation is that as K[f, g] gets larger, it becomes easier to discriminate 
between the densities / and g; that is, there is more information for discrimination. 
From the model (5.24), we can assess the information between the data and the 
parameter by calculating K[jt(9\x), n (0)], where 


(5.26) 


tt(6) = 


J 7t(6\X)\jr(X) dX, 
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n(9\x) ■ 


f(x\9)n(G) f(x\9)iv(9) 


f(x\9)jt(9)d9 m(x) 

By comparison, the information between the data and the hyperparameter is mea¬ 
sured by K[n(X\x), f(X)], where 

/ f(x\9)jt(9\X)x/j'(X) d9 m(x\X)f(X) 


(5.27) 


tt(X\x) = 


m(x) mix) 

An important result about the two measures of information for (5.26) and (5.27) 
is contained in the following theorem. 

Theorem 5.7 For the model (5.24), 

(5.28) K[jt(X\x), i//(a)] < K[tt(9\x), tt(9)]. 

From (5.28), we see that the distribution of the data has less effect on hyperpriors 
than priors, or, turning things around, the posterior distribution of a hyperparameter 
is less affected by changes in the prior than the posterior distribution of a parameter. 
This provides justification of the belief that parameters that are deeper in the 
hierarchy have less effect on inference. 

Proof of Theorem 5.7. By definition, 

V(A|x)\ 

fW ) 

"7T(A|x)\ t / 7T(X\xf 

Jw 


(5.29) K[jt(X\x), f(X)] 


7t(X\x) log 


Now, note that 
(5.30) 


Tt(X\x) 


L 

L 


= r ( f(m \ 

Jn V m(x) ) 


dX 


log 


ir(X) 


f(X) dX. 


tt( 9\X) d9 , 


or, more succinctly, jr(X\x)/f(X) = E[f(x\9)/m(x)], where the expectation is 
taken with respect to jt(9\X). We now apply Jensen’s inequality to (5.29), using 
the fact that the function x log x is convex if x > 0, which leads to 


7t(X\x) 


log 


n(X\x) 

Jw 


(5.31) 


< E 


f[ 


mey 

m(x) 

( f(x\9) 

V mix) 

(f(x\9) 
\ m(x) 


log 

log 

log 


mey 

m(x) 


nm v 

mix) )_ 

nm \ 

mix) J 


jt(6\X) d9. 


Substituting back into (5.29), we have 
(5.32) K[tt(X\x), f(X)] 

'm 9) 


// 


m(x) 


log 


/w\ 

mix) ) 


n(9\X)f(X)d9 dX. 


We now (of course) interchange the order of integration and notice that 
f(x\9) 


(5.33) 


L 


m(x ) 


-jt(9\X)f(X)dX = n(9\x). 
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Substitution into (5.32), together with the fact that yields (5.28). □ 

Thus, hierarchical modeling allows us to be less concerned about the exact form 
of This frees the modeler to choose a \l/0.) to yield other good properties 
without unduly compromising the Bayesian interpretations of the model. For ex¬ 
ample, as we will see in Chapter 5, i j/(X) can be chosen to yield hierarchical Bayes 
estimators with reasonable frequentist performance. 

A full development of information measures and hierarchical models is given 
by Goel and DeGroot (1979, 1981); see also Problems 5.16-5.19. 

Theorem 5.7 shows how information acts within the levels of a hierarchy, but 
does not address the, perhaps, more basic question of assessing the information 
provided by a prior distribution in a particular model. Information measures, such 
as K[f , g], can also be the basis of answering this latter question. If X ~ f(x\6) 
and © ~ it (6), then prior distributions that have a large effect on 7t(d\x) should 
produce small values of K [n (9 |x), n{6)\ since the prior and posterior distributions 
will be close together. Alternatively, prior distributions that have a small effect 
on n(9\x) should produce large values of K[jt(9\x), jt(9)], as the posterior will 
mainly reflect the sampling density. Thus, we may seek to find a prior it (6) that 
produces the maximum value of K[jt(9\x), t r((?)]. We can consider such a prior to 
have the least influence on f(x\9) and, hence, to be a default, or noninformative, 
prior. 

The above is an informal description of the approach to the construction of a 
reference prior , initiated by Bernardo (1979) and further developed and formalized 
by Berger and Bernardo (1989, 1992). [See also Robert 1994a, Section 3.4]. This 
theory is quite involved, but approximations due to Clarke and Barron (1990,1994) 
and Clarke and Wasserman (1993) shed some interesting light on the problem. First, 
we cannot directly use K[tx(9\x), tt(9)] to derive a prior distribution, because it is 
a function of x. We, thus, consider its expected value with respect to the marginal 
distribution of X , the Shannon information 

(5.34) S(n) = J K[n(0\x), Jt(6)]m„(x)dx, 

where m n (x) = f f(x\9)jT(9) d9 is the marginal distribution. The reference prior 
is the distribution that maximizes S(it). 

The following theorem is due to Clarke and Barron (1990). 

Theorem 5.8 Let X\,,X„be an iid sample from f(x\9), and let S n (n) denote 
the Shannon information of the sample. Then, as n -> oo, 

k n C |/(6»)| 1/2 

(5.35) S n (n) = ~ log — + / it(0) log /' dB + o(l) 

2 Lite J ir(9) 

where k is the dimension of 9 and 1(9) is the Fisher information 
KO) = ~E ^log f(X\9) . 

As the integral in the expansion (5.35) is the only term involving the prior n(6), 
maximizing that integral will maximize the expansion. Provided that |/(0)|^ 2 is 
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integrable. Corollary 1.7.6 shows that jt(9) = |/(0)|'^ 2 is the appropriate choice. 
This is the Jeffreys prior, which was discussed in Section 4.1. 


Example 5.9 Binomial reference prior. For X \,..., X n iid as Bernoulli^), we 

have 


(5.36) 


m = -e 


r a 2 

dO 2 


log/(X|0) 


n 

9(1-ey 


which yields the Jeffreys prior Jt(6) oc [6( 1 — $)] _1//2 . This is also the prior that 
maximizes the integral in S n (n) and, in that sense, imparts the least information 
on f(x\6). A formal reference prior derivation also shows that the Jeffreys prior is 
the reference prior. ! 


In problems where there are no nuisance parameters, the Jeffreys and reference 
priors agree, even when they are improper. In fact, the reference prior approach 
was developed to deal with the nuisance parameter problem, as the Fisher infor¬ 
mation approach gave no clear-cut guidelines as to how to proceed in that case. 
Reference prior derivations for nuisance parameter problems are given by Berger 
and Bernardo (1989, 1992a, 1992b) and Poison and Wasserman (1990). See also 
Clarke and Wasserman (1993) for an expansion similar to (5.35) that is valid in 
the nuisance parameter case. 


6 Empirical Bayes 

Another generalization of single-prior Bayes estimation, empirical Bayes estima¬ 
tion, falls outside of the formal Bayesian paradigm. However, it has proven to 
be an effective technique of constructing estimators that perform well under both 
Bayesian and frequentist criteria. One reason for this, as we will see, is that empir¬ 
ical Bayes estimators tend to be more robust against misspecification of the prior 
distribution. 

The starting point is again the model (3.1), but we now treat y as an unknown 
parameter of the model, which also needs to be estimated. Thus, we now have two 
parameters to estimate, necessitating at least two observations. We begin with the 
Bayes model 

(6.1) Xi\e ~ f(x\9), i = , 

©IK ~ n(6\y). 

and calculate the marginal distribution of X, with density 

(6.2) m(x\y) = / n f(Xi\e)n(0\Y)de. 

Based on m(x\y), we obtain an estimate, y(x), of y. It is most common to take 
y(x) to be the MLE of y, but this is not essential. We now substitute y(x) for y in 
jt(9\y) and determine the estimator that minimizes the empirical posterior loss 

L(e,8(x))n(e\x,y(x))d9. 


(6.3) 
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This minimizing estimator is the empirical Bayes estimator. 

An alternative definition is obtained by substituting y(x) for y in the Bayes 
estimator. Although, mathematically, this is equivalent to the definition given here 
(see Problem 6.1), it is statistically more satisfying to define the empirical Bayes 
estimator as minimizing the empirical posterior loss (6.3). 

Example 6.1 Normal empirical Bayes. To calculate an empirical Bayes estima¬ 
tor for the model (5.7) of Example 5.2, rather than integrate over the prior for r 2 , 
we estimate r 2 . We determine the marginal distribution of X (see Problem 6.4), 


(6.4) 


m(x|r 2 ) 


I PJ f(xi\0)n(0\x 2 )d0 
1 e -^ T<x '-~ x? 1 


(271 a 2 )"! 2 


(27TT 2 ) 1 / 2 


/ OO 

-00 


e £ (x 6)2 e 5*2 dd 


—OO 

1 1 


(2jx) n l 2 a" \cr 2 + nr 2 


1/2 if V<x,-xj 


L - 



(Note the similarity to the density (2.13) in the one-way random effects model.) 
From this density, we can now estimate r 2 using maximum likelihood (or some 
other estimation method). Recalling that we are assuming cr 2 is known, we find 
the MLE of a 2 + nr 2 given by a 2 + nr 2 = maxfcr 2 , nx 2 }. Substituting into the 
single-prior Bayes estimator, we obtain the empirical Bayes estimator 


(6.5) 


E(© \x, f) = 




cr 2 + nr 2 / 


g 2 \ - 
maxfcr 2 , nx 2 }) 


It is tempting to ask whether the empirical Bayes estimator is ever a Bayes 
estimator; that is, can we consider tt(9\x, y(x)) to be a “legitimate” posterior 
density, in that it be derived from a real prior distribution? The answer is yes, but 
the prior distribution that leads to such a posterior may sometimes not be proper 
(see Problem 6.2). 

We next consider an example that illustrates the type of situation where empirical 
Bayes estimation is particularly useful. 

Example 6.2 Empirical Bayes binomial. Empirical Bayes estimation is best 
suited to situations in which there are many problems that can be modeled si¬ 
multaneously in a common way. For example, suppose that there are K different 
groups of patients, where each group has n patients. Each group is given a different 
treatment for the same illness, and in the kth group, we count X^,k = 1...., K, the 
number of successful treatments out of n. Since the groups receive different treat¬ 
ments, we expect different success rates; however, since we are treating the same 
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illness, these rates should be somewhat related to each other. These considerations 
suggest the hierarchy 

(6.6) X)■ ~ binomial! p k , n), 

p k ~ beta(a, b ), k = 1, ... ■, K, 


where the K groups are tied together by the common prior distribution. As in 
Example 1.5, the single-prior Bayes estimator of pk under squared error loss is 

(6.7) ^(JCn) = E{p k \x k , a , b) = ° + ** . 

a + b + n 

In Example 1.5, a and b are assumed known and all calculations are straightfor¬ 
ward. In the empirical Bayes model, however, we consider these hyperparameters 
unknown and estimate them. To construct an empirical Bayes estimator, we first 
calculate the marginal distribution 


( 6 . 8 ) 


m(x\a, b ) 




p?( 1 - Pk)' 1 


x + b) p a k ~\ 1 - p k ) b ~ l dp k 

r (a)T(by k 


t— r / n \ T(a + b)T(a + x k )T{n — x k + b) 
{ = | \ x k) r{a)T{b)T{a + b + ri) 


a product of beta-binomial distributions. We now proceed with maximum likeli¬ 
hood estimation of a and b based on (6.8). Although the MLEs a and b are not 
expressible in closed form, we can calculate them numerically and construct the 
empirical Bayes estimator 

(6.9) S n (x k ) = E{p k \x k ,a,b)= a+ ^ Xk . 

a + b + n 

The Bayes risk of E(p k \x k , a. b) is only slightly higher than that of the Bayes 
estimator (6.7), and is given in Table 6.1. For comparison, we also include the 
Bayes risk of the unbiased estimator x/n. The first three rows correspond to a 
prior mean of 1 /2, with decreasing prior variance. Notice how the risk of the 
empirical Bayes estimator is between that of the Bayes estimator and that of X/n. 


As Example 6.2 illustrates, and as we will see later in this chapter (Section 7), 
the Bayes risk performance of the empirical Bayes estimator is often “robust”; that 
is, its Bayes risk is reasonably close to that of the Bayes estimator no matter what 
values the hyperparameters attain. 

We next turn to the case of exponential families, and find that a number of 
the expressions developed in Section 3 are useful in evaluating empirical Bayes 
estimators. In particular, we find an interesting representation for the risk under 
squared error loss. 



4.6] 


EMPIRICAL BAYES 


265 


Table 6.1. Bayes Risks for the Bayes, Empirical Bayes, and Unbiased Estimators of Example 
6.2, where K = 10 and n = 20 

Prior Parameters Bayes Risk 

a b S n of (6.7) 8* of (6.9) x/n 


2 

2 

.0833 

.0850 

.1000 

6 

6 

.0721 

.0726 

.1154 

20 

20 

.0407 

.0407 

.1220 

3 

1 

.0625 

.0641 

.0750 

9 

3 

.0541 

.0565 

.0865 

30 

10 

.0305 

.0326 

.0915 


For the situation of Corollary 3.3, using a prior jr(?;|A), where X is a hyperpa¬ 
rameter, the Bayes estimator of (3.12) becomes 


( 6 . 10 ) 


9 9 

E(m |x, A) = — log m(x\X) - — log /z(x) 

OXi dXj 


where w(x|A) = f p v (x)7T(ri\X) di] is the marginal distribution. Simply substituting 
an estimate of X, A(x) into (6.10) yields the empirical Bayes estimator 


( 6 . 11 ) 


E(rjj\x,X)= —\ogm(x\X) 

OXi 


X.=X(x) 


— log h(x). 
OXi 


If X is, in fact, the MLE of A based on m(x\X), then the empirical Bayes estimator 
has an alternate representation. 

Theorem 6.3 For the situation of Corollary 3.3, with prior distribution jt(ri\X), 
suppose A(x) is the MLE ofX based on m(x|A). Then, the empirical Bayes estimator 
is 

9 9 

(6.12) E(t]i |x, X) = -— log m(x|A(x)) - — log h(x). 


dxi 


d.Xi 


Proof. Recall from calculus that if /(•, •) and g(f) are differentiable functions, 
then 

9 


(6.13) y-/U, Six)) = g'(x)-^-f(x, y) 

dx dy 

Applying this to w(x|A(x)) shows that 


y=gU) 


+ ~x~f(x,y) 
dx 


y=g(x) 


9 . 9 - 9 

— log/?i(x|A(x)) = — X(x) — log m (x | A) 

OXi OXi 0 A 


X=k(\) 


9 

+ logm(x|A) 

OXi 


X=X(\) 


= — log m (x I A) 

OXi 


k=X{\) 


because (9/9A) log m(x|A) is zero at A = A(x). Hence, the empirical Bayes estima¬ 
tor is equal to (6.12). □ 
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Thus, for estimating the natural parameter of an exponential family, the empiri¬ 
cal Bayes estimator (using the marginal MLE) can be expressed in the same form 
as a formal Bayes estimator. Here we use the adjective formal to signify a math¬ 
ematical equivalence, as the function m(x|A.(x)) may not correspond to a proper 
marginal density. See Bock (1988) for some interesting results and variations on 
these estimators. 


Example 6.4 Normal empirical Bayes, /i unknown. Consider the estimation of 
9i in the model of Example 3.4, 

(6.14) X,|Of ~ N(9i, a 2 ), i = 1,..., p, independent, 

(6.15) 0, ~ N(fi , r 2 ), i = 1,..., p, independent, 


where /i is unknown. We can use Theorem 6.3 to calculate the empirical Bayes 
estimator, giving 


£(©,jx, jx) = a 2 
where p is the MLE of // from 
m(x Im) = 


3 a 

— log m(x|/i) - —h(x) 
ax: dx,- 


1 


[2n(a 2 + x 2 )]p / 2 


-T— T~ S(X; —U ) 2 

e 2(<t 2 +t 2 ) v ' 


Hence, p. = x and 


3 „ 3 

— log;«(x|/x) = — 

OXi OXi 


-1 


z\~ 


_2(cr- + t 2 ) 
This yields the empirical Bayes estimator 


E(x, - x) 


£(©i|x, A) : 


-Xi + 


rv, 


o L + T z <T Z + r z 
which is the Bayes estimator under the prior n(9 \x). 

An advantage of the form (6.12) is that it allows us to represent the risk of the 
empirical Bayes estimator in the form specified by (3.13). The risk of the empirical 
Bayes estimator (6.12) is given by 


R[T), E(t) |X, MX))] = Rln, -V log /z(X)] 


(6.16) 


i =1 


2-^ log m(X\ i(X)] 
o X. • 


+ ( — logm[X|MX)] 

d A i 


Using the MLE fi(x) = x, differentiating the log of m(x\p(x)), and substituting 
into (6.12) shows (Problem 6.10) that 

.2 


RlV, E{v\X, j±(X)}] = p/a z 

2 (p - l) 2 


P~ 1 


p(a 2 + r 2 ) p(o 2 + t 2 ) 2 “ 


E n (X t - X) 2 
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Table 6.2. Values of the Hierarchical Bayes (HB)(5.8) and Empirical Bayes (EB) Estimate 
(6.5). 




Value of x 


V 

.5 

2 

5 

10 

2 

.27 

1.22 

4.36 

9.69 

10 

.26 

1.07 

3.34 

8.89 

30 

.25 

1.02 

2.79 

7.30 

00 

.25 

1.00 

2.50 

5.00 

EB 

0 

1.50 

4.80 

9.90 


As mentioned at the beginning of this section, empirical Bayes estimators can 
also be useful as approximations to hierarchical Bayes estimators. Since we often 
have simpler expressions for the empirical Bayes estimator, if its behavior is close 
to that of the hierarchical Bayes estimator, it becomes a reasonable substitute (see, 
for example, Kass and Steffey 1989). 

Example 6.5 Hierarchical Bayes approximation. Both Examples 5.2 and 6.1 

consider the same model, where in Example 5.2 the hierarchical Bayes estimator 
(5.8) averages over the hyperparameter, and in Example 6.1 the empirical Bayes 
estimator (6.5) estimates the hyperparameter. A small numerical comparison in 
Table 6.2 suggests that the empirical Bayes estimator is a reasonable, but not 
exceptional, approximation to the hierarchical Bayes estimator. 

The approximation, of hierarchical Bayes by empirical Bayes, is best for small 
values of v [defined in (5.7)] and deteriorates as v -> oo. At v = oo, the hierarchical 
Bayes estimator becomes a Bayes estimator under a N( 0, 1) prior (see Problem 
6.11). Notice that, even though (6.5) provides us with a simple expression for an 
estimator, it still requires some work to evaluate the mean squared error, or Bayes 
risk, of (6.5). However, it is important to do so to obtain an overall picture of the 
performance of the estimator (Problem 6.12). | 


Although the (admittedly naive) approximation in Example 6.5 is not very ac¬ 
curate, there are other situations where the empirical Bayes estimator, or slight 
modifications thereof, can provide a good approximation to the hierarchical Bayes 
estimator. We now look at some of these situations. 

For the general hierarchical model (5.1), the Bayes estimator under squared 
error loss is 

(6.17) E(0|x) = J 6n(0\x)dd 

which can be written 


£(©|x) = 



9tt(0\x, y)jt(y\x)dy dd 


(6.18) 
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where 

(6.19) 

with 

( 6 . 20 ) 


J £(©|x, y)n(y\x)dy 


7r(y|x): 


m(x|y): 


m (x) : 


/ 

/ 


_ m(x\y)\j/(y) 
m (x) 

f(x\0)7T(d\y)de, 
m(x\y)\l/(y) dy. 


Now, suppose that 7r(y\x) is quite peaked around its mode, y n . We might then 
consider approximating Zs(0|x) by E (01x, y^). Moreover, if il/(y) is relatively 
flat, as compared to m(x\y), we would expect 7r(y|x) ^ m(x\y ) and y n & y, 
the marginal MLE. In such a case, Zs(0|x, y n ) would be close to the empirical 
Bayes estimator £(©|x, y), and hence the empirical Bayes estimator is a good 
approximation to the hierarchical Bayes estimator (Equation 5.4.2). 


Example 6.6 Poisson hierarchy. Although we might expect the empirical Bayes 
and hierarchical Bayes estimators to be close if the hyperparameter has a flat-tailed 
prior, they will, generally, not be equal unless that prior is improper. Consider the 
model 


(6.21) Xj ~ Poisson (A.,), i = 1 ,, p, independent, 

A, ~ Gamma(o, b), i = 1..... p, independent, a known. 

The marginal distribution of A, is 

'■°° e~ x ‘Xf 1 


m(x,\b) 


=L 


0 x/! r (ci)b a 


-xr l e~ ki/b dXi 


r(x,- + a) l 

Xi\T(a) b a 
Xj + a — 1 

a — 1 

a negative binomial distribution. Thus, 


1 

1 + - 
b 

b 


—(xi+a) 


b +1 


b +1 


( 6 . 22 ) 


m(x\b) ■ 


n 

i=i 


Xj + a — 1 
a — 1 


b+ 1 


IT 


b + 1 


and the marginal MLE of b is b = x/a. From (6.21), the Bayes estimator is 


(6.23) 


E(\j\Xi,b) : 


b 


b+ 1 


(a + x,) 


and, hence, the empirical Bayes estimator is 


(6.24) 


E(Xi\xi, b ): 


x + a 


-(a +x,). 
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If we add a prior \fr(b) to the hierarchy (6.21), the hierarchical Bayes estimator can 
be written 


(6.25) 


E(k t |x) 


J E(kj\xj, b)ir(b\x) db 


where 


(6.26) 





From examination of the hierarchy (6.21), a choice of J/(b) might be an inverted 
gamma, as this would be conjugate for X,. However, these priors will not lead to 
a simple expression for E(Xj\x) (although they may lead to good estimators). In 
general, however, we are less concerned that the hyperprior reflect reality (which 
is a concern for the prior), since the hyperprior tends to have less influence on 
our ultimate inference (Theorem 5.7). Thus, we will often base the choice of the 
hyperprior on convenience. 

Let us, therefore, choose as prior for b an /-'-distribution. 


(6.27) 


i//(/>) oc 


b a ~ 1 

(1 + b) a+ P 


which is equivalent to putting a beta(a, yS) prior on h /(1 + b). The denominator of 
7t(b\x) in (6.26) is 


f 


b+ 1 


px 


b +1 


pa 


b a 


(6.28) 


: jT t P* +a -\\ - t) pa+ P- 1 dt = 


(1 +b) a+ P 

t 


db 


r(px + a)T(pa + ji) 


T(/4v + pa + a + P) ’ 

and (6.23), (6.26), and (6.28) lead to the hierarchical Bayes estimator 


E(ki\x): 


(6.29) 


/ 


^(A/lx, b)n(b\x)db 
T(pl- + pa + a + P) 


T{px + a)r(pci + 


(a + Xj) 


L 


px+\ 


1 


b +1 

Y(px + pa + a + P) 


pa 


b+\) (1 +b) a+ P 


db 


|_ T( px + a)r(pa + p) 
px + a 


T(px + a + l)r(pa + P) 


[_ px + pa + a + P 


T (px + pa + a + p + 1) 
(a + Xj). 


(a + Xj) 


The hierarchical Bayes estimator will therefore be equal to the empirical Bayes 
estimator when a = p = 0. This makes i j/(b) oc (1 //; ) an improper prior. However, 
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the calculation of Jt(b\x) from (6.26) and E(X t \x) from (6.29) will still be valid. 
(This model was considered by Deely and Lindley (1981), who termed it Bayes 
Empirical Bayes.) 

To further see how the empirical Bayes estimator is an approximation to the 
hierarchical Bayes estimator, write 

px + a x P 

px + pa + a + P x + a p(x + a) 
paa + paP + aP 
p 2 (x + a) 2 
2 aaP 
p 2 (x + a) 3 

This shows that the empirical Bayes estimator is the leading term in a Taylor series 
expansion of the hierarchical Bayes estimator, and we can write 

(6.30) E(X i \x)=E(X i \x i ,b)+0 . 

Estimators of the form (6.29) are similar to those developed by Clevenson and 
Zidek (1975) for estimation of Poisson means. The Clevenson-Zidek estimators, 
which have a = 0 in (6.29), are minimax estimators of X (see Section 5.7). j 


If interest centers on obtaining an approximation to a hierarchical Bayes esti¬ 
mator, a more direct route would be to look for an accurate approximation to the 
integral in (6.17). When such an approximation coincides with the empirical Bayes 
estimator, we can safely consider the empirical Bayes estimator as an approximate 
hierarchical Bayes estimator. 


Example 6.7 Continuation of Example 5.2. In Example 5.2, the hierarchical 
Bayes estimator (5.5.8) was approximated by the empirical Bayes estimator (5.6.5). 
If, instead, we seek a direct approximation to (5.5.8), we might start with the Taylor 
expansion of (1 + 0 2 /y) _(v+1 ^ 2 around x 


(6.31) 


1 


1 


(i + e 2 /v)^i 2 (i + T 2 /v)(v +1 )/ 2 

V + 1 X 


-(0 — x) + 0[(6 — x)~ 2 ], 


v (1 + x 2 /v) (y+y> l 2 

and using this in the numerator and denominator of (5.5.8) yields the approximation 
(Problem 6.15) 


(6.32) 


E(0|x) 


( l (v+v \ 

V p(v + x 2 )J 


x + O 



Notice that the approximation is equal to the empirical Bayes estimator if v = 0, an 
extremely flat prior! The approximation (6.32) is better than the empirical Bayes 
estimator for large values of v, but worse for small values of v. | 


The approximation (6.32) is a special case of a Laplace approximation (Tierney 
and Kadane 1986). The idea behind the approximation is to carry out a Taylor 
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series expansion of the integrand around an MLE, which can be summarized as 

(6.33) J b{X)e~ nhW dX = b(X) 

Here, h(X) is the unique minimum of h(X)\ that is, X is the MLE based on a 
likelihood proportional to e ~ nh ( x \ (See Problem 6.17 for details.) In applying 
(6.33) to a representation like (6.18), we obtain 



Z?(@|x) = J E(@\x,X)jT(X\x)dX 

= J E(@\x,X)e niognWx)1 '" dX 

V2nn(X\x) ~ 

= —^- E(@\x,X) 

_-^ lo g7r(A.|x)| 1=x _ 


where X is the mode of 7r(k|x). Thus, £(0 |a, X) in (6.34) will be the empirical 
Bayes estimator if 7r(k|x) oc m(x\X), that is, if \//(X) = 1. Moreover, the expression 
in square brackets in (6.34) is equal to 1 if jt ( a | x ) is normal with mean X and 
variance equal to the inverse of the observed Fisher information (see Problem 
6.17). 

Both the hierarchical and empirical Bayes approach are generalizations of single¬ 
prior Bayes analysis. In each case, we generalize the single prior to a class of priors. 
Hierarchical Bayes then averages over this class, whereas empirical Bayes chooses 
a representative member. Moreover, we have considered the functional forms of 
the prior distribution to be known; that is, even though © and y are unknown, 
n{6\y) and ir{y) are known. 

Another generalization of single-prior Bayes analysis is robust Bayes analysis, 
where the class of priors is treated differently. Rather than summarize over the 
class, we allow the prior distribution to vary through it, and examine the behavior 
of the Bayes procedures as the prior varies. Moreover, the assumption of knowledge 
of the functional form is relaxed. Typically, a hierarchy like (3.1) is used, and a 
class of distributions for tt(- |-) is specified. For example, a popular class of prior 
distributions for © is given by an e-contamination class 


(6.35) n = {tt( 0\X) : jt(6\X) = (1 — e)7To(0|k) + ecj(9), q e Q} 


where ttq(9\X) is a specified prior (sometimes called the root prior ) and q is 
any distribution in a class Q. [Here, Q is sometimes taken to be the class of 
all distributions, but more restrictive classes can often provide estimators and 
posterior distributions with desirable properties. See, for example, Berger and 
Berliner 1986. Also, Mattner (1994) showed that for densities specified in the form 
of e-contamination classes, the order statistics are complete. See Note 1.10.5.) 

Using (6.35), we then proceed in a formal Bayesian way, and derive estimators 
based on minimizing posterior expected loss resulting from a prior tc e n, say 
tv* . The resulting estimator, say T T *. is evaluated using measures that range over 
all 7T e n, to assess the robustness of <5 T * against misspecification of the prior. 
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For example, one might consider robustness using the posterior expected loss, or 
robustness using the Bayes risk. In this latter case, we might look at (Berger 1985, 
Section 4.7.5) 

(6.36) sup r{n, 8) 

neTl 

and, perhaps, choose an estimator <5 that minimizes this quantity. If the loss is 
squared error, then for any estimator 8 , we can write (Problem 6.2) 

(6.37) r(n, 8) = r(7t , 8 n ) + E(8 - S’ 1 ) 2 , 


where 8 n is the Bayes estimator under tc . From (6.37), we see that a robust Bayes 
estimator is one that is “close” to the Bayes estimators for all tc e n. An ultimate 
goal of robust Bayes analysis is to find a prior it* e n for which S n * can be 
considered to be robust. 

Example 6.8 Continuation of Example 3.1. To obtain a robust Bayes estimator 
of 9, consider the class of priors 


(6.38) n = {tt : ir(6) = (1 — e)7ro(d|ro) + sq(9)} 

where jcq = N(9 , Tg), to is specified, and q(9 ) = /7r(@|r 2 )7r(r 2 |a, b)dr 2 , as in 
Problem 6.3(a). The posterior density corresponding to a distribution it e n is 
given by 

(6.39) tc(9\x) = X(x)tcq(9\x, to) + (1 — k(x))g(0|x, a, b) 
where A(x) is given by 


(6.40) 


k(x) = 


_ (1 - £>h^ 0 (x |r 0 ) _ 

(1 - s)m no (x |t 0 ) + em q (x\a, b ) 


(see Problem 5.3). Using (6.39) and (6.40), the Bayes estimator for 9 under squared 
error loss is 


(6.41) £(©|x, To, a , b) = X(x)E(&\x, r 0 ) + (1 — a, b ), 

a convex combination of the single-prior and hierarchical Bayes estimators, with 
the weights dependent on the marginal distribution. A robust Bayes analysis would 
proceed to evaluate the behavior (i.e., robustness) of this estimator as tc ranges 
though n. j 


7 Risk Comparisons 

In this concluding section, we look, in somewhat more detail, at the Bayes risk 
performance of some Bayes, empirical Bayes, and hierarchical Bayes estimators. 
We will also examine these risks under different prior assumptions, in the spirit of 
robust Bayes analysis. 

Example 7.1 The James-Stein estimator. Let X have a /;-variate normal distribu¬ 
tion with mean 6 and covariance matrix a 1 1, where o 2 is known; X ~ N p (6 , o 2 I). 
We want to estimate 0 under sum-of-squared-errors loss 

p 

L[0,8(x)] = \0- <5(x)| 2 = Yj&i - ^ x )) 2 ’ 

1 = 1 
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using a prior distribution 0 ~ N(0, r 2 1), where r 2 is assumed to be known. 

The Bayes estimator of 0 is 5 T (x) = |r 2 /(cr 2 + r 2 )]x, x being the vector of 
componentwise Bayes estimates. It is straightforward to calculate its Bayes risk 

po 2 x 2 

(7.1) r(r,S T )=^ 

a 1 + r- 

An empirical Bayes approach to this problem would replace r 2 with an estimate 
from the marginal distribution of x. 


(7.2) 


m(x\ t~) ■ 


1 


[2tr(cr 2 + r 2 )]f/ 2 


£ 2(o 2 +r 2 ) 


-Ex? 


Although, for the most part, we have used maximum likelihood to estimate the 
hyperparameters in empirical Bayes estimators, unbiased estimation provides an 
alternative. Using the unbiased estimator of r 2 /(er 2 + r 2 ), the empirical Bayes 
estimator is (Problem 7.1) 


(7.3) 


<$ /S (x) = 



(P ~ 2)ct- 


x, 


the James-Stein estimator. 


This estimator was discovered by Stein (1956b) and later shown by James and 
Stein (1961) to have a smaller mean squared error than the maximum likelihood 
estimator X for all 0. Its empirical Bayes derivation can be found in Efron and 
Morris (1972a). 

Since the James-Stein estimator (or any empirical Bayes estimator) cannot attain 
as small a Bayes risk as the Bayes estimator, it is of interest to see how much larger 
its Bayes risk r{ r, S js ) will be. This, in effect, tells us the penalty we are paying 
for estimating r 2 . 

As a first step, we must calculate r( r, 8 JS ), which is made easier by first obtaining 
an unbiased estimate of the risk R(0, S js ). The integration over 6 then becomes 
simple, since the integrand becomes constant in 0. 

Recall Theorem 3.5, which gave an expression for the risk of a Bayes estimator 
of the form (3.3.12). In the normal case, we can apply the theorem to a fairly wide 
class of estimators to get an unbiased estimator of the risk. 

Corollary 7.2 Let X ~ N p (0. a 2 1), and let the estimator 8 be of the form 

<5(x) = x - g(x), 

where g(x) = {g,(x)} is differentiable. If Eg\(d/dXj)gj(X)\ < oo for i = 1, ..., p, 
then 

(7.4) R(0,8) = E e \0 - 8(X)\ 2 

= pa 2 + E e \g(X)\ 2 - 2er 2 Y E 0 ^-g,(X). 

tr dx ‘ 


Hence, 

(7.5) 


p 3 

R(S(x)) = per 2 + |g(x)| 2 - 2a 2 £ — gi (x) 

OXi 
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is an unbiased estimator of the risk R(0, <5). 

Proof. In the notation of Theorem 3.5, in the normal case — d/dx, log /;(x) = 
Xjla 1 , and the result now follows by identifying g(x) with V log mix), and some 
calculation. See Problem 7.2. □ 

For the James-Stein estimator (7.3), we have g(x) = (p — 2)ct 2 x/|x| 2 ; hence, 

(p ~ 2) 2 ct 4 " 

. IXI 2 _ 

3 (p — 2 )a 2 Xi 


(7.6) 


R(6,8 JS ) = pa 2 + E e 

P 


-2a 2 Y Ee 


i =I 


3 X t 


IXI' 


= pa 2 + (p — 2) 2 o 4 Eg —p-r 


-2 (p - 2)cr 4 Y E e 


|X| 


■ 2X: 


2l 


|X|- 


= pa 2_ (p _ 2) 2 a 4 Eff __ ! 

so R(S JS (x)) = pa 2 - (p - 2) 2 a 4 /|x| 2 . 

Example 7.3 Bayes risk of the James-Stein estimator. Under the model of Ex¬ 
ample 7.1, the Bayes risk of <5 1 s is 


r(r, 8 JS ): 


: [ R(0.8 js )tt( 0) d6 
JQ 


If 

Jn Jx 

/!/ 


pa 


( P — 2) 2 er 4 


f(x\0)7t(0)dxd0 


po- 


(p ~ 2) 2 ct 4 


7t(0\x)d0 \ m(x)dx 


where we have used (7.6), and changed the order of integration. Since the integrand 
is independent of 6 , the inner integral is trivially equal to 1, and 


(7.7) 


r(r, 8 JS ) = pa 1 - (p - 2)-<7 4 £'— 


2„4, 


1 


Here, the expected value is over the marginal distribution of X (in contrast to (7.6), 
where the expectation is over the conditional distribution of X|0). 


Since, marginally, E 


r(r, 5 ys ) = pa 1 — 


2 , we have 


(P ~ 2)cr 4 


(7.8) 


pa 2 r 2 2 a 4 


r(T,8 z ) + 


2a 4 


a- + x 


2 ' 


Here, the second term represents the increase in Bayes risk that arises from esti¬ 
mating T 2 . | 
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It is remarkable that S Js has a reasonable Bayes risk for any value of r 2 , although 
the latter is unknown to the experimenter. This establishes a degree of Bayesian 
robustness of the empirical Bayes estimator. Of course, the increase in risk is a 
function of a 2 and can be quite large if cr 2 is large. Perhaps a more interesting 
comparison is obtained by looking at the relative increase in risk 

r(r,8 JS )~ r(r,8 z ) _ 2 cr 2 

r{ r, <5 r ) p r 2 ' 

We see that the increase is a decreasing function of the ratio of the sample-to- 
prior variance and goes to 0 as a 2 / r 2 —> 0. Thus, the risk of the empirical Bayes 
estimator approaches that of the Bayes estimator as the sampling information gets 
infinitely better than the prior information. 


Example 7.4 Bayesian robustness of the James-Stein estimator. To further ex¬ 
plore the robustness of the James-Stein estimator, consider what happens to the 
Bayes risk if the prior used to calculate the Bayes estimator is different from the 
prior used to evaluate the Bayes risk (a classic concern of robust Bayesians). 

For the model in Example 7.1, suppose we specify a value of r, say Tq. The 
Bayes estimator , <5 T °, is given by 8 z °(xj ) = [r ( y/(r ( 2 + cr 2 )].*,-. When evaluating the 
Bayes risk, suppose we let the prior variance take on any value r 2 , not necessarily 
equal to Tq . Then, the Bayes risk of (5 T ° is (Problem 7.4) 


(7.9) 


r( r, <5 r °) = pa 2 


Ta + a 2 


+ p r 


Ta 2 + a 2 


2 


which is equal to the single-prior Bayes risk (7.1) when to = r. However, as 
r 2 —> oo, r(r, 8 Z °) -> oo, whereas r( r, 8 Z ) -> pa 2 . 

In contrast, theBayes risk of8 JS , given in (7.8), is validfor all r with r(r. 8 JS ) —> 
pa 2 as r 2 -* oo. Thus, the Bayes risk of 8 JS remains finite for any prior in the 
class, demonstrating robustness. j 


In constructing an empirical Bayes estimator in Example 7.1, the use of unbiased 
estimation of the hyperparameters led to the James-Stein estimator. If, instead, we 
had used maximum likelihood, the resulting empirical Bayes estimator would have 
been (Problem 7.1) 



where (a) + = max{0, a). Such estimators are known as positive-part Stein estima¬ 
tors. 

A problem with the empirical Bayes estimator (7.3) is that when |x| 2 is small 
(less than (p — 2)er 2 ), the estimator has the “wrong sign”; that is, the signs of the 
components of 8 JS will be opposite those of the Bayes estimator 8 r . This does 
not happen with the estimator (7.10), and as a result, estimators like (7.10) tend to 
have improved Bayes risk performance. 

Estimators such as (7.3) and (7.10) are called shrinkage estimators , since they 
tend to shrink the estimator X toward 0, the shrinkage target. Actually, of the two, 
only (7.10) completely succeeds in this effort since the shrinkage factor 1 — (p — 
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2)er 2 /|x| 2 may take on negative (and even very large negative) values. Nevertheless, 
the terminology is used to cover also (7.3). The following theorem is due to Efron 
and Morris (1973a). 

Theorem 7.5 Let X ~ N p (0, a 2 1) and 0 ~ A' ;) (0. r 2 / ), with loss function 
L(0 , S) = \0 — <5| 2 . IfS(x) is an estimator of the form 

«(x) = [1 - B(x)]x 


and if 


S + (x) = [ 1 - B(x)] + x, 


then r(r, 8 ) > r( r, (5 + ), with strict inequality if Pg(8(X.) f <5 + (X)) > 0. 
Proof For any estimator <S(x), the posterior expected loss is given by 


£[L(M(x))|x] 


= / 0 5 >- 

Jn i=l 

p 


Si(x)) 2 7t(0\x)d0 


(7.11) 


IS [(0i — E(6j\x)) 2 + {E{0i\x) — 5,(x)) 2 ] 


i=i 

X7r(0|x) cl0 


where we have added ±£(6) |x) and expanded the square, noting that the cross-term 
is zero. Equation (7.11) can then be written as 


(7.12) 


E[L(0,S(x)) |x] = £var(0,|x) 

i=1 

P 

+ Y}E(e i \x)-8 l (x)f. 

1=1 

As the first term in (7.12) does not depend on the particular estimator, the difference 
in posterior expected loss between S and <5 + is 

(7A3)E[L(0, <5(x)|x] - E[L(0, S + (x))|x] 

p 

= J2 {[£(0.-W - ^«] 2 - [£(0/|x)] 2 } /(|B(X)| > 1) 

i =1 

since the estimators are identical when |B(x)| < 1. However, since E{6 t |x) = 
t 2 /( ct 2 + t 2 )x,, it follows that when |B(x)| > 1, 

9 n 2 r 9 -.2 


■ Sj(x) 


> 


Thus, (7.13) is positive for all x, and the result follows by taking expectations. □ 

In view of results like Theorem 7.5 and other risk results in Chapter 5 (see 
Theorem 5.5.4), the positive-part Stein estimator 

,2\ + 


(7.14) 


<5 + (x) = 1 - 


(P ~ 2)cr“ 
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is preferred to the ordinary James-Stein estimator (7.3). Moreover, Theorem 7.5 
generalizes to the entire exponential family (Problem 7.8). It also supports the use 
of maximum likelihood estimation in empirical Bayes constructions. 

The good Bayes risk performance of empirical Bayes estimators is not restricted 
to the normal case, nor to squared error loss. We next look at Bayes and empirical 
Bayes estimation in the Poisson case. 

Example 7.6 Poisson Bayes and empirical Bayes estimation. Recall the Poisson 
model of Example 6.6: 


(7.15) Xi ~ Poisson(A,), i = 1 _ p. independent, 

A~ Gamma(o, b). 


For estimation of A, under the loss 


(7.16) 


L,(A, S) = J2 i(*i - Si) 2 , 


1=1 A i 


the Bayes estimator (see Example 1.3) is 


(7.17) 


TO = 


b +1 

kr„\ — sk 


(Xj + a — k). 


The posterior expected loss of 8f (x) = 8f(xj) is 


(7.18) 


i[A,- - S^Xi)] 2 \ Xi 

A; 




L 


x / (A,- - 5, ) A“ 


2\a+Xi—k—\ — 


e b dXi, 


since the posterior distribution of A, |x/ is Gamma(fl + x,-. A-). Evaluating the 
integral in (7.18) gives 

2-it 


(7.19) E 


A [A; - Sf(x, )] 2 |x,- 


T (a + Xi — k) 

V(a + x,) V.^+1 


(fl + Xi — k). 


To evaluate the Bayes risk, r (k, S k ), we next sum (7.19) with respect to the marginal 
distribution of Xi, which is Negative BinomiaKc/, ^). For k = 0 and k = 1, we 
have 

n cib 2 , b 

r(0, S°) = p~ -- and r(l,8 1 )=p- - 

b + 1 o+l 

See Problems 7.10 and 7.11 for details. 

For the model (7.15) with loss function L^i A, S) of (7.16), an empirical Bayes 
estimator can be derived (similar to (6.6.24); see Example 6.6) as 


(7.20) 8^ b (x) = —— (xt +a- k). 

' x + a 

We shall now consider the risk of the estimator S EB . For the loss function (7.16), 
we can actually evaluate the risk of a more general estimator than S EB . The coordi- 
natewise posterior expected loss of an estimator of the form 8f = (p(x)(x, + a — k) 
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IS 




[h ~ Sf] 2 \x 


(7.21) 


r(« + v i )(^p i Jo 


F(a + Xi — k) 


Y(a + Xi) \b + 1 


/»oo 

/ [A.,- - «Sf(x)] 2 A“ +JC '^V"‘ *r c/A ( - 
Jo 


-k 


E[(k - ^(x)) 2 |x)] 


where the expectation is over the random variable A, with distribution Gamma(« + 
Xi — k, Using the same technique as in the proof of Theorem 7.5 [see (7.12)], 
we add ±<$^(x,) = ^(a + a,- — k) in (7.21) to get 

r i 

£ —r [A 

U? 

T(a + Xi — A:) 


Sff lx 


(7.22) = 


F(a + x,) 


/? 

/?+ 1 


2—k 


F(a + x, — k ) 

r(o + Xj) ^fc+i 


(u + A",- — k ) 
^ 2 b 


b +1 


<p(x) ) (u + A/ — k) . 


The first term in (7.22) is the posterior expected loss of the Bayes estimator, and 
the second term reflects the penalty for estimating b. Evaluation of the Bayes 
risk, which involves summing over a, , is somewhat involved (see Problem 7.11). 
Instead, Table 7.1 provides a few numerical comparisons. Specifically, it shows 
the Bayes risks for the Bayes (S k ), empirical Bayes (S EB ), and unbiased estimators 
(X) of Example 6.6, based on observing p independent Poisson variables, for the 
loss function (7.16) with k = 1. The gamma parameters are chosen so that the 
prior mean equals 10 and the prior variances are 5 (a = 20, b = .5), 10 (a = 10, 
b = 1), and 25 (a = 4, b = 2.5). It is seen that the empirical Bayes estimator attains 
a reasonable Bayes risk reduction over that of X, and in some cases, comes quite 
close to the optimum. j 


As a final example of Bayes risk performance, we turn now to the analysis of 
variance. Here, we shall consider only the one-way layout (Examples 4.1, 4.6 and 
4.9) in detail. Other situations and generalizations are illustrated in the problems 
(Problems 7.17 and 7.18). 

Example 7.7 Empirical Bayes analysis of variance. In the one-way layout (con¬ 
sidered earlier in Example 3.4.9 from the point of view of equivariance), we have 

(7.23) Xjj ~ Nfa, a 2 ), j = i = l,...,s, 

= /x + a,-, i = 1,..., s 

where we assume that Ea, = 0 to ensure the identifiability of parameters. With this 
restriction, the parameterization in terms of p and a, is equivalent to that in terms 
of I,-, with the latter parameterization (the so-called cell means model ; see Searle 
1987) being computationally more friendly. As interest often lies in estimation of, 
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Table 7.1. Comparisons of Some Bayes Risks for Model (7.15) 


Prior var. 


P = 5 


S k of(7.17) 

8 EB of (7.20) 

X 

5 

1.67 

2.28 

5.00 

10 

2.50 

2.99 

5.00 

25 

3.57 

3.84 

5.00 



p = 20 


Prior var. 

8* of(7.17) 

8 eb of (7.20) 

X 

5 

6.67 

7.31 

20.00 

10 

10.00 

10.51 

20.00 

25 

14.29 

14.52 

20.00 


and testing hypotheses about, the differences of the a,s, which are equivalent to 
the differences of the $fs, we will use the version of the model. We will also 
specialize to the balanced case where all n, ’s are equal. The more general case 
requires some (often much) extra effort. (See Problems 7.16 and 7.19). 

As an illustration, consider an experiment to assess the effect of linseed oil meal 
on the digestibility of food by steers. The measurements are a digestibility coeffi¬ 
cient, and there are five treatments, representing different amounts of linseed oil 
meal added to the feed (approximately 1, 2, 3, 4, and 5 kg/animal/day; see Hsu 
1982 for more details.) The variable Xjj of (7.23) is the /th digestibility measure¬ 
ment in the /th treatment group, where if, is the true coefficient of digestibility of 
that group. Perhaps the most common hypothesis about the §, ’s is 

(7.24) #o : §t = ?2 = • • • = Hs = M unknown. 

This specifies that the means are equal and, hence, the treatment groups are equiv¬ 
alent in that they each result in the same (unknown) mean level of digestibility. 
This hypothesis can be thought of as specifying a submodel where all of the £’s 
are equal, which suggests expanding (7.23) into the hierarchical model 

(7.25) X,j |f, ~ N(fj , a 2 ), j = 1,..., n, / = 1, ..., s, independent, 

§, |/x ~ N(n, r 2 ), i = 1,..., s, independent. 

The model (7.25) is obtained from (7.24) by allowing some variation around the 
prior mean, /x, in the form of a normal distribution. 

In analogy to (4.2.4), the Bayes estimator of f is 

B a 2 nr 2 

(7.26) S B (xt) = — - -fi + —-- -Xi. 

a- + nr- a- + nr - 
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Calculation of an empirical Bayes estimator is straightforward. Since the marginal 
distribution of Xj is 


X, ~ N 



i = 1,..., s. 


the MLE of fi is x = , *ij /ns and the resulting empirical Bayes estimator is 


(7.27) 


'i 

sEB 


&eb = 


fit 


a- + ffr z 


rX + 


a- + n x- 




Note that S EB is a linear combination of X,, the UMVU estimator under the full 
model, and X, the UMVU estimator under the submodel that specifies = • • • = 

If we drop the assumption that r 2 is known, we can estimate (a 2 +nt 2 ) -1 by the 
unbiased estimator (.v — 3)/Y.(x, — x) 2 and obtain the empirical Bayes estimator 

(7.28) Sf = x + (l - ~ 3) !. 2 2 ) (x, - x), 

V E(x, - XY ) 


which was first derived by Lindley (1962) and examined in detail by Efron and 
Morris (1972a 1972b, 1973a, 1973b). 


Calculation of the Bayes risk of S L proceeds as in Example 7.3, and leads to 


(7.29) r^,S L ) = s -Os — 3) 2 I — 


-i-i 


J2 ( X' - ^) 2 


= r($,8 B ) + 


3 (<r 2 /n) 2 
a 2 /n + r 2 


where ^. =1 (X,- — X) 2 ~ (cr 2 /« + t 2 )x 2 _j and r(§, 8 B ) is the risk of the Bayes 
estimator (7.26). See Problem 7.14 for details. 

If we compare (7.29) to (7.8), we see that the Bayes risk performance of S L , 
where we have estimated the value of /x, is similar to that of 8 JS , where we assume 
that the value of /i is known. The difference is that 8 L pays an extra penalty for 
estimating the point that is the shrinkage target. For 8 JS , the target is assumed 
known and taken to be 0, while S L estimates it by X. The penalty for this is that 
the factor in the term added to the Bayes risk is increased from 2 in (7.8), where 
k = 1 to 3. In general, if we shrink to a /.'-dimensional subspace, this factor is 2 + k. 


More general submodels can also be incorporated in empirical Bayes analyses, 
and in many cases, the resulting estimators retain good Bayes risk performance. 

Example 7.8 Analysis of variance with regression submodel. Another common 
hypothesis (or submodel) in the analysis of variance is that of a linear trend in the 
means, which was considered earlier in Example 3.4.7 and can be written as the 
null hypothesis 

Ho : = a + jit ,, i = 1,..., s, a and /3 unknown, r, known. 

For the situation of Example 7.7, this hypothesis would assert that the effect of the 
quantity of linseed oil meal on digestibility is linear. (We know that as the quantity 
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of linseed oil meal increases, the coefficient of digestibility decreases. But we do 
not know if this relationship is linear.) In analogy with (7.25), we can translate the 
hypothesis into the hierarchical model 

(7.30) X u \l;i ~ N(&, a 2 ), j = i = l,...,s, 

f;|a, P ~ N(a + pti, r 2 ), i = 


Again, the hypothesis models the prior mean of the £,■ ’s, and we allow variation 
around this prior mean in the form of a normal distribution. Using squared error 
loss, the Bayes estimator of is 


(7.31) 


l(« + PU) + 


:X t . 


For an empirical Bayes estimator, we calculate the marginal distribution of X ,. 
Xi ~ N(a + ptj, a 2 + nr 2 ), i = 1,..., s, 


and estimate a and ft by 


a = X — fit, P 


T(Xj - X)(tj - t) 
S( U - r ) 2 ’ 


the UMVU estimators of a and p (Section 3.4). This yields the empirical Bayes 
estimator 


8 f B ' 


r(« + PU) + 


;Xj. 


(7.32) 

If r 2 is unknown we can, in analogy to Example 7.7 use the fact that, marginally, 
£[E(Z/ — a + Ptj) 2 ]~ l = (s — 4)/(a 2 /n + r 2 ) to construct the estimator 

(,y - 4 )ct 2 


(7.33) 
with Bayes risk 


82 = a + PU + I 1 


S(Xi -a- pti ) 2 


(X, - a - Pp ) 


(7.34) 


r (r, 8 EBl ) = s - (s — 4) 2 ( — 


n 


jyx, -a- pp ) 2 


p 4 (o l /ri) 

= r(|, 5 b )+ ’ ’ 


a 1 In + r 2 


where r(£, 5 s ) is the risk of the Bayes estimator 8 B of (7.31). See Problem 7.14 
for details. 

Notice that here we shrunk the estimator toward a two-dimensional submodel, 
and the factor in the second term of the Bayes risk is 4 (2 + k). We also note that 
for 8 EBl , as well as 8 JS and 8 L , the Bayes risk approaches that of 8 71 as n -> oo. || 


In both Examples 7.7 and 7.8, empirical Bayes estimators provide a means for 
attaining reasonable Bayes risk performance if a 2 / r 2 is not too large, yet do not 
require full specification of a prior distribution. An obvious limitation of these 
results, however, is the dimension of the submodel. The ordinary James-Stein 
estimator shrinks toward the point 0 [see Equation (7.3)], or any specified point 
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(Problem 7.6), and hence toward a submodel (subspace) of dimension zero. In the 
analysis of variance. Example 7.7, the subspace of the submodel has dimension 

1, {(£i, ..., £s) : = /i, i = 1, ..., x}, and in Example 7.8, it has dimension 

2, {(§i,..., |s) : {i = a + /3tj, i = 1,..., s}. In general, the empirical Bayes 
strategies developed here will only work if the dimension of the submodel, r, is at 
least two fewer than that of the full model, s\ that is, s — r >2. This is a technical 
requirement, as the marginal distribution of interest is y^_ T , and estimation is 
problematic if s — r < 2. The reason for this difficulty is the need to calculate the 
expectation E( 1/Xj 2 _ r )> which is infinite if ,v — r <2. (See Problem 7.6; also see 
Problem 6.12 for an attempt at empirical Bayes if s — r < 2.) 

In light of Theorem 7.5, we can improve the empirical Bayes estimators of 
Examples 7.7 and 7.8 by using their positive-part version. Moreover, Problem 
7.8 shows that such an improvement will hold throughout the entire exponential 
family. Thus, the strategy of taking a positive part should always be employed in 
these cases of empirical Bayes estimation. 

Finally, we note that Examples 7.7 and 7.8 can be greatly generalized. One 
can handle unequal n,-, unequal variances, full covariance matrices, general linear 
submodels, and more. In some cases, the algebra can become somewhat over¬ 
whelming, and details about performance of the estimators may become obscured. 
We examine a number of these cases in Problems 7.16-7.18. 

8 Problems 
Section 1 

1.1 Verify the expressions for n(\\x) and S k (x) in Example 1.3. 

1.2 Give examples of pairs of values (a, b ) for which the beta density B(a, b) is (a) 
decreasing, (b) increasing, (c) increasing for p < p 0 and decreasing for p > p 0 , and 
(d) decreasing for p < p 0 and increasing for p > p 0 . 

1.3 In Example 1.5, if p has the improper prior density 1 , show that the posterior 

density of p given x is proper, provided 0 < x < n. 

1.4 In Example 1.5, find the Jeffreys prior for p and the associated Bayes estimator S A . 

1.5 For the estimator <5 A of Problem 1.4, 

(a) calculate the bias and maximum bias; 

(b) calculate the expected squared error and compare it with that of the UMVU esti¬ 
mator. 

1.6 In Example 1.5, find the Bayes estimator S of p( 1 — p) when p has the prior B(a, b). 

1.7 For the situation of Example 1.5, the UMVU estimator of p( 1 — p) is S' = [x(x — 
1 )\/[n(n — 1)] (see Example 2.3.1 and Problem 2.3.1). 

(a) Compare the estimator <5 of Problem 1.6 with the UMVU estimator <5'. 

(b) Compare the expected squared error of the estimator of p(\ — p) for the Jeffreys 
prior in Example 1.5 with that of S'. 

1.8 In analogy with Problem 1.2, determine the possible shapes of the gamma density 
T(g, 1/a), a, g > 0. 

1.9 Let Xi, ..., X n be iid according to the Poisson distribution / > (k) and let X have a 
gamma distribution T(g, a). 
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(a) For squared error loss, show that the Bayes estimator S a g of X has a representation 
analogous to (1.1.13). 

(b) What happens to s as (i) n -4- oo, (ii) a —> oo, g —»■ 0. or both? 

1.10 For the situation of the preceding problem, solve the two parts corresponding to 
Problem 1.5(a) and (b). 

1.11 In Problem 1.9, if X has the improper prior density dX/X (corresponding to a = g = 
0), under what circumstances is the posterior distribution proper? 

1.12 Solve the problems analogous to Problems 1.9 and 1.10 when the observations 
consist of a single random variable X having a negative binomial distribution Nb(p , m), 
p has the beta prior B(a , b), and the estimand is (a) p and (b) 1/p. 


Section 2 

2.1 Referring to Example 1.5, suppose that X has the binomial distribution b(p, n) and 
the family of prior distributions for p is the family of beta distributions B(a, b). 


(a) Show that the marginal distribution of X is the beta-binomial distribution with mass 
function 

f n\ r(a + b) r(.r + a)r(n — x + b) 

\JC ) r(a)T(b) r (n + a + b) ' 

(b) Show that the mean and variance of the beta-binomial is given by 


EX = 


na 
a + b 


and 


var X = n 



[Hint: For part (b), the identities EX = E[E(X\pj\ and varX = var[£(X|p)] + 
£[var(JS([p)] are helpful.] 

2.2 For the situation of Example 2.1, Lindley and Phillips (1976) give a detailed account 
of the effect of stopping rules, which we can illustrate as follows. Let X be the number 
of successes in n Bernoulli trials with success probability p. 


(a) Suppose that the number of Bernoulli trials performed is a prespecifed number n, 

n 


so that we have the binomial sampling model. P(X = x) 


p x { 1 - p)"- 


x — 0, 1, .... n. Calculate the Bayes risk of the Bayes estimator (1.1.12) and the 
UMVU estimator of p. 


(b) Suppose that the number of Bernoulli trials performed is a random variable N. The 
value N = n was obtained when a prespecified number, x , of successes was observed 

( n — 1 

so that we have the negative binomial sample model, P(N = n) ■ 


x — 1 


P x ( 1- 


p)" x , n = x. Calculate the Bayes risk of the Bayes estimator and the UMVU 
estimator of p. 


(c) Calculate the mean squared errors of all three estimators under each model. If it is 
unknown which sampling mechanism generated the data, which estimator do you 
prefer overall? 


2.3 Show that the estimator (2.2.4) tends in probability (a) to 9 as n —>■ oo, (b) to p as 
b —> 0, and (c) to 6 as b —> oo. 
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2.4 Bickel and Mallows (1988) further investigate the relationship between unbiasedness 
and Bayes, specifying conditions under which these properties cannot hold simultane¬ 
ously. In addition, they show that if a prior distribution is improper, then a posterior mean 
can be unbiased. Let X ~ ^f(x/8),x > 0, where / 0 °° tf(t)dt = 1, and let n(9) = h de > 

e > o. 

(a) Show that E(X\0) = 9, so X is unbiased. 

2 

(b) Show that n(9\x) = f(x/6 ) is a proper density. 

(c) Show that E(9\x) = x, and hence the posterior mean, is unbiased. 

2.5 DasGupta (1994) presents an identity relating the Bayes risk to bias, which illustrates 
that a small bias can help achieve a small Bayes risk. Let X ~ f(x\9) and 9 ~ n(9). 
The Bayes estimator under squared error loss is S n = E(9\x). Show that the Bayes risk 
of S 71 can be written 

r(n,S 7I ) = / f [9 - S 7t (x)] 2 f(x\9)n(9)dxd9 = f 9b(9)n(9)d9 

Je Jx Js 

where b(9) = £[<$*(2 i)\9] -9 is the bias of S’ 1 . 

2.6 Verify the estimator (2.2.10). 

2.7 In Example 2.6, verify that the posterior distribution of r is T(r +g — 1/2, l/(o' + ;)). 

2.8 In Example 2.6 with a = g = 0, show that the posterior distribution given the 2Ts of 
sfn(9 — X)/^/Z/(n — 1) is Student's f-distribution with n — I degrees of freedom. 

2.9 In Example 2.6, show that the posterior distribution of 9 is symmetric about x when 
the joint prior of 9 and a is of the form h(o)do d9, where h is an arbitrary probability 
density on (0, oo). 

2.10 Rukhin (1978) investigates the situation when the Bayes estimator is the same for 
every loss function in a certain set of loss functions, calling such estimators universal 
Bayes estimators. For the case of Example 2.6, using the prior of the form of Problem 
2.9, show that X is the Bayes estimator under every even loss function. 

2.11 Let X and Y be independently distributed according to distributions P j and Q tl , 
respectively. Suppose that £ and ij are real-valued and independent according to some 
prior distributions A and A'. If, with squared error loss, S A is the Bayes estimator of ^ 
on the basis of X, and <5^, is that of ?; on the basis of Y, 

(a) show that <5^, — cS A is the Bayes estimator of ij — £ on the basis of ( X , F); 

(b) if> 0 and 5 * h , is the Bayes estimator of 1 /ij on the basis of F, show that S A ■ <5*, 
is the Bayes estimator of on the basis of (X , F). 

2.12 For the density (2.2.13) and improper prior (do/a) ■ (do A /o A ), show that the pos¬ 
terior distribution of (o, o A ) continues to be improper. 

2.13 (a) In Example 2.7, obtain the Jeffreys prior distribution of ( 0 , r). 

(b) Show that for the prior of part (a), the posterior distribution of (er, r) is proper. 

2.14 Verify the Bayes estimator (2.2.14). 

2.15 Let Jf ~ N(9, 1) and L(0, S) = (9 - S) 2 . 

(a) Show that X is the limit of the Bayes estimators S’ r ", where n n is N( 0, 1). Hence, 
X is both generalized Bayes and a limit of Bayes estimators. 

(b) For the prior measure n(9) = e ae , a > 0, show that the generalized Bayes estimator 
is X + a. 

(c) For a > 0, show that there is no sequence of proper priors for which S n " —> X + a. 
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This example is due to Farrell; see Kiefer 1966. Fleath and Sudderth (1989), building on 
the work of Stone (1976), showed that inferences front this model are incoherent, and 
established when generalized Bayes estimators will lead to coherent (that is, noncontra¬ 
dictory) inferences. Their work is connected to the theory of “approximable by proper 
priors,” developed by Stein (1965) and Stone (1965, 1970, 1976), which shows when 
generalized Bayes estimators can be looked upon as Bayes estimators. 

2.16 (a) For the situation of Example 2.8, verify that <5(*) = x/n is a generalized Bayes 
estimator. 

(b) If X ~ 7V(0, 1) and L(9, S ) = (9 — <5) 2 , show that X is generalized Bayes under the 
improper prior n(9 ) = 1. 


Section 3 

3.1 For the situation of Example 3.1: 

(a) Verify that the Bayes estimator will only depend on the data through Y = max,- X t . 

(b) Show that £(©|y, a , b) can be expressed as 


E(@\y,a,b) 


1 P(Xj („+„-!) < 2 / b y) 

bin+a-V) P(xi(„ +a ) < 2/M 


where y 2 is a chi-squared random variable with v degrees of freedom. (In this form, 
the estimator is particularly easy to calculate, as many computer packages will have 
the chi-squared distribution built in.) 


3.2 Let Xi, ..., X n be iid from Gamma(a, b) where a is known. 


(a) Verify that the conjugate prior for the natural parameter )/ = — \/b is equivalent to 
an inverted gamma prior on b. 

(b) Using the prior in part (a), find the Bayes estimator under the losses (i) L(b , S ) = 
(b - <5) 2 and (ii) L(b, 8) = (1 - 8/bf. 

(c) Express the estimator in part (b)(i) in the form (3.3.9). Can the same be done for 
the estimator in part (b)(ii)? 

3.3 (a) Prove Corollary 3.3. 

(b) Verify the calculation of the Bayes estimator in Example 3.4. 

3.4 Using Stein’s identity (Lemma 1.5.15), show that if X t ~ p m (x) of (3.3.7), then 


E„(-VlogA(X)) = ? , 

*(*,-VlogA(X)) = £E, 
1 = 1 


' 3 2 

3 Xf 


log /i(X) 


3.5 (a) If Xj ~ Gamma(a, b), i = 1, ..., p, independent with a known, calculate 
— V log h(x) and its expected value. 

(b) Apply the results of part (a) to the situation where X, ~ N( 0, of), i = 1, • • •, p, 
independent. Does it lead to an unbiased estimator of ofl 

[Note: For part (b), squared error loss on the natural parameter 1 /a 2 leads to the 
loss L(o 2 , S ) = (o 2 S — l) 2 /cr 4 for estimation of o 2 .] 

(c) If 

tan(fl,-7r) 

Xj ~ -x ‘ ( 1 — x) , 0 < x < 1, i = 1, •.., p, independent, 

it 

evaluate —V log MX) and show that it is an unbiased estimator of a = (ai, ..., a p ). 
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3.6 For the situation of Example 3.6: 

(a) Show that if <5 is a Bayes estimator of 6, then S' = S/a 2 is a Bayes estimator of ?/, 
and hence R(0, S) = a 4 R(rj, S'). 

(b) Show that the risk of the Bayes estimator of r/ is given by 

pr4 + f—Yy> 2 . 

a 2 (a 2 + r 2 ) 2 \a 2 + t 2 J 1 

where a, = ??,• — p/a 2 . 

(c) If Y.af = k, a fixed constant, then the minimum risk is attained at = p/a 2 +^/k/ p. 

3.7 If X has the distribution p e (x) of (1.5.1) show that, similar to Theorem 3.2, E(Tr)(9)) = 
V log m„ (x) — V log h (x). 

3.8 (a) Use Stein’s identity (Lemma 1.5.15) to show that if X t ~ p rli (x) of (3.3.18), 

then 

E„(- V log h(X)) = V tti E„ ^r,(X). 

0 A ; 

I J 

(b) If Xi are iid from a gamma distribution Gamma(fl, b), where the shape parameter 
a is known, use part (a) to find an unbiased estimator of 1 /b. 

(c) If the Xj are iid from a beta(a, b) distribution, can the identity in part (a) be used 
to obtain an unbiased estimator of a when b is known, or an unbiased estimator of 
b when a is known? 

3.9 For the natural exponential family />,(.*) of (3.3.7) and the conjugate prior n(ij\k, p) 
of (3.3.19) establish that: 

(a) E(X) = A'(tj) and var X = A"{r]), where the expectation is with respect to the 
sampling density /?,(*). 

(b) EA’{r]) = p and var[A(t;)] = (1 /k)EA"(rf), where the expectation is with respect 
to the prior distribution. 

[The results in part (b) enable us to think of p as a prior mean and k as a prior sample 
size.] 

3.10 For each of the following situations, write the density in the form (3.7), and identify 
the natural parameter. Obtain the Bayes estimator of A/rf) using squared loss and the 
conjugate prior. Express your answer in terms of the original parameters, (a) X ~ 
binomiahp, n). (b) X ~ Poisson(A), and (c) X ~ Gammafa, b), a known. 

3.11 For the situation of Problem 3.9, if X), ..., X n are iid as p,(.r) and the prior is the 
conjugate n(ri\k, p), then the posterior distribution is Tt{ii\k + n, kl t *" x ). 

3.12 If Xi . X„ are iid from a one-parameter exponential family, the Bayes estimator 

of the mean, under squared error loss using a conjugate prior, is of the form aX + b for 
constants a and b. 

(a) If EXj = p and var = a 2 , then no matter what the distribution of the Xj’ s, the 
mean squared error is 

E[{aX + b) — p] 2 = a 2 var X + [(a — 1 )p + b] 2 . 

(b) If p is unbounded, then no estimator of the form aX + b can have finite mean 
squared error for a =/ 1. 

(c) Can a conjugate-prior Bayes estimator in an exponential family have finite mean 
squared error? 
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[This problem shows why conjugate-prior Bayes estimators are considered “non-robust.”] 


Section 4 

4.1 For the situation of Example 4.2: 

(a) Show that the Bayes rule under a beta(a, a) prior is equivariant. 

(b) Show that the Bayes rule under any prior that is symmetric about 1 /2 is equivariant. 

4.2 The Bayes estimator of j; in Example 4.7 is given by (4.22). 

4.3 The Bayes estimator of r in Example 4.5 is given by (4.22). 

4.4 The Bayes estimators of and r in Example 4.9 are given by (4.31) and (4.32). 
(Recall Corollary 1.2.) 

4.5 For each of the following situations, find a group G that leaves the model invariant 

and determine left- and right-invariant measures over G. The joint density of X = 
(Xi ,.... X„) and Y = (lj. Y n ) and the estimand are 

(a) /(x — »/, y ~ ?)> estimand ?? — 

(b) / (^, ^), estimand r/a; 

(c) / (^, /, —), r unknown; estimand — f. 

4.6 For each of the situations of Problem 4.5, determine the MRE estimator if the loss is 
squared error with a scaling that makes it invariant. 

4.7 For each of the situations of Problem 4.5: 

(a) Determine the measure over (2 induced by the right-invariant Haar measure over 

G; 

(b) Determine the Bayes estimator with respect to the measure found in part (a), and 
show that it coincides with the MRE estimator. 

4.8 In Example 4.9, show that the estimator 

f f 4/ *= ]L )dvdu 
' ~ ff&f ^dvdu 

is equivariant under scale changes; that is, it satisfies f(cx) = cf (x) for all values of r 
for which the integrals in f (x) exist. 

4.9 If A is a left-invariant measure over G, show that A* defined by A *{B) = A(B -1 ) is 
right invariant, where B~ l = [g~' : g e B }. 

[Hint: Express A *(Bg) and A *(B) in terms of A.] 

4.10 There is a correspondence between Haar measures and Jeffreys priors in the location 
and scale cases. 

(a) Show that in the location parameter case, the Jeffreys prior is equal to the invariant 
Haar measure. 

(b) Show that in the scale parameter case, the Jeffreys prior is equal to the invariant 
Haar measure. 

(c) Show that in the location-scale case, the Jeffreys prior is equal to the left invariant 
Haar measure. 

[Part c) is a source of some concern because, as mentioned in Section 4.4 (see the 
discussion following Example 4.9). the best-equivariant rule is Bayes against the right- 
invariant Haar measure (if it exists).] 
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4.11 For the model (3.3.23). find a measure v in the (§, r) plane which remains invariant 
under the transformations (3.3.24). 

The next three problems contain a more formal development of left- and right-invariant 
Haar measures. 

4.12 A measure A over a group G is said to be right invariant if it satisfies A (Bg) = A (B) 
and left invariant if it satisfies A (gB) = A (B). Note that if G is commutative, the two 
definitions agree. 


(a) If the elements g e G are real numbers (—oo < g < oo) and group composition is 
g 2 ■ gi = g i + g 2 , the measure v defined by v(B) = f B dx (i.e., Lebesgue measure) 
is both left and right invariant. 

(b) If the elements g e G are the positive real numbers, and composition of g 2 and gi 
is multiplication of the two numbers, the measure v defined by v(B) = f B (l/y)dy 
is both left and right invariant. 


4.13 If the elements g e G are pairs of real numbers (a, b ), b > 0, corresponding to the 
transformations gx = a + bx, group composition by (1.4.8) is 

( a 2 , b 2 ) ■ (ai, b\) = (« 2 + 01 ^ 2 . bib 2 ). 


Of the measures defined by 

i '(B) = fl — dx dy and v(B) = J J — dx dy, 

the first is right but not left invariant, and the second is left but not right invariant. 

4.14 The four densities defining the measures v of Problem 4.12 and 4.13 (dx, (l/y)dy, 
(1 /y)dxdy, (1 /y 2 )dxdy) are the only densities (up to multiplicative constants) for which 
v has the stated invariance properties in the situations of these problems. 

[Hint: In each case, consider the equation 

/ n(9)d8 = f n(6)d9. 

J B J gB 

In the right integral, make the transformation to the new variable or variables 9' = g~ l 9. 
If J is the Jacobian of this transformation, it follows that 


/ [n(9) - Jn(g9 )] d6 = 0 for all B 
J B 

and, hence, that n(6) = Jn(g8 ) for all 9 except in a null set N g . The proof of Theorem 
4 in Chapter 6 of TSH2 shows that N g can be chosen independent of g. This proves in 
Problem 4.12(a) that for all 9 N, n(9) = n(9 + c), and hence that n(c) = constant a.e. 
The other three cases can be treated analogously.] 


Section 5 

5.1 For the model (3.3.1), let n(9\x, X) be a single-prior Bayes posterior and n(6\x) 
be a hierarchical Bayes posterior. Show that n(9\x) = f n(9\x, X) ■ 7t(X\x)dX, where 
7i(k\x) = f f(x\8)n(9\X)y(X)d9/ff f(x\9)n(9\X)y(X)d9 dX. 

5.2 For the situation of Problem 5.1, show that: 

(a) E(9\x) = E[E(8\x, X)]; 

(b) var(6»|x) = £[var(6»|.r, X)] + var[£(6»|x, A.)]; 
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and hence that n(9\x) will tend to have a larger variance than jr(6\x, X 0 ). 

5.3 For the model (3.3.3), show that: 


(a) The marginal prior of 9, unconditional on r 2 , is given by 


n (9): 


r(a + i) 


1 


V2tz T(a)b a /i + e±\ a+l ' 2 ' 


whichfora = v/2andi> = 2/v is Student’s f-distribution with v degrees of freedom, 

(b) The marginal posterior of r 2 is given by 


n(z 2 \x) ■■ 


[a** r 2 J 


1/2 


' ( T 2 )0 +3/2< 


-l/t?r 2 


r°° [" a 2 r 2 1 
Jo |_* 2 +r 2 J 


1/2 


-fracYl- 


( r 2 ) a+3 / 2 


e-Vb^dt 1 


5.4 Albert and Gupta (1985) investigate theory and applications of the hierarchical model 

Xi \9i ~ b(9j, n), i = 1,..., p, independent, 

9,\rj ~ beta[^? 7 , Ar( 1 — ;;)], k known. 
ri ~ Uniform(0, 1). 


(a) Show that 


£(e - |x) =(^)(f) + (^) £( " |X) - 

k 2 


var(f?,|x) = 


(n + k){n + k + 1) 


var(?/|x). 


[Note that £(?;|x) and var(?j|x) are not expressible in a simple form.] 
(b) Unconditionally on j/, the 0, ’s have conditional covariance 


cov(6>/, 0j|x): 


n + k 


var(t?|x), ijj. 


(c) Ignoring the prior distribution of rj, show how to construct an empirical Bayes 
estimator of 0,. (Again, this is not expressible in a simple form.) 

[Albert and Gupta (1985) actually consider a more general model than given here, and 
show how to approximate the Bayes solution. They apply their model to a problem of 
nonresponse in mail surveys.] 

5.5 (a) Analogous to Problem 1.7.9, establish that for any random variable X , Y . and 
Z, 

cov(X, Y) = E[ cov(A, T)|Z] + cov[£(X|Z), E(Y\Z)]. 

(b) For the hierarchy 

Xj\0j ~ f(x\9i), i = 1, ... , p, independent, 

©,17. ~ Tt(Qj\X), i = 1,..., p, independent, 

A ~ Y(X), 

show that cov(©,, ©/|x) = cov[£(©;|x, A), £(©^|x, 7.)]. 
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(c) If E(@j |x, X) = g(.Xi ) + h(X), i = 1,..., p, where g(-) and /?(•) are known, then 
cov(@i, 0 j |x) = var[£(©,|x, A)]. 

[Part (c) points to what can be considered a limitation in the applicability of some 
hierarchical models, that they imply a positive correlation structure in the posterior 
distribution.] 

5.6 The one-way random effects model of Example 2.7 (see also Examples 3.5.1 and 
3.5.5) can be written as the hierarchical model 

Xjj\n, ctj ~ N(fi + o',, a 2 ), j = l,...,n, i = l,...,s, 
oti ~ N( O.oj), i = l,...,s. 

If, in addition, we specify that p, ~ Uniform(—oo, oo), show that the Bayes estimator 
of fj. + a, under squared error loss is given by (3.5.13), the UMVU predictor of // + a t . 

5.7 Referring to Example 6.6: 

(a) Using the prior distribution for y(b) given in (5.6.27), show that the mode of the 
posterior distribution n(b\x) is b = (px + a — l)/(pa + /3 — 1), and hence the 
empirical Bayes estimator based on this b does not equal the hierarchical Bayes 
estimator (5.6.29). 

(b) Show that if we estimate b/(b+ 1) using its posterior expectation E [b/(b+ 1) |x], then 
the resulting empirical Bayes estimator is equal to the hierarchical Bayes estimator. 

5.8 The method of Monte Carlo integration allows the calculation of (possibly compli¬ 
cated) integrals by using (possibly simple) generations of random variables. 

(a) To calculate f h (x )fx(x) dx , generate a sample X t , , X m , iid, from fx (x) . Then, 

1/m XXi h(Xi) / h(x)fx{x)dx as m —>■ oo. 

(b) If it is difficult to generate random variable from fx(x), then generate pairs of 
random variables 


Yi - My), 

x, - fx\y{x\ yi ). 

Then, 1/m h(Xj) —> f h(x)f x (x)dx as m -*■ oo. 

[Show that if X is generated according to Y ~ /r(y) and X ~ fx\y(x\Y). then 
P{X<a) = f a _ oo fAx)dx.] 

(c) If it is difficult to generate as in part (b), then generate 

X m , ~ fx\Y(x\Y m . 

Y mt ~ f r \ X (y\x mi ). 

for i = 1,..., K and m = 1..... M. 

Show that: 

£ 

(i) for eachwt, {X m .} is aMarkov chain. If it is also an ergodic Markov chamX m . —> X, 
as i —> oo, where X has the stationary distribution of the chain. 

(ii) If the stationary distribution of the chain is fx{x), then 

1 v V ', f 

— / h(x)f x (x)dx 
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[This is the basic theory behind the Gibbs sampler. For each k, we have generated in¬ 
dependent random variables X mk , m = 1 , ,M, where X,„ k is distributed according 

to fx\r(x\y mt _). It is also the case that for each m and large k, X,„ k is approximately 
distributed according to fx(x), although the variables are not now independent. The ad¬ 
vantages and disadvantages of these computational schemes ( one-long-chain vs. many- 
short-chains ) are debated in Gelrnan and Rubin 1992; see also Geyer and Thompson 
1992 and Smith and Roberts 1992. The prevailing consensus leans toward one long 
chain.] 

5.9 To understand the convergence of the Gibbs sampler, let (X, Y ) ~ f(x , y), and define 
k(x,x') = J fx\Y(x\y)fY\x(y\x')dy. 

(a) Show that the function /;*(•) that solves h*(x ) = f k(x,x')h*(x')dx' is h*(x) = 
fx(x), the marginal distribution of X. 

(b) Write down the analogous integral equation that is solved by f Y (y). 

(c) Define a sequence of functions recursively by /z, + i (_\r) = f k(x, x')ht{x') dx\ where 

h Q (x) is arbitrary but satisfies sup r | | < oo. Show that 

J \h i+ i(x) — h*(x)\dx < J | hj{x) — h*(x)\ dx 

and, hence, hj(x) converges to h*(x). 


[The method of part (c) is called successive substitution. When there are two variables in 
the Gibbs sampler, it is equivalent to data augmentation (Tanner and Wong 1987). Even 
if the variables are vector-valued, the above results establish convergence. If the original 
vector of variables contains more than two variables, then a more general version of this 
argument is needed (Gelfand and Smith 1990).] 

5.10 A direct Monte Carlo implementation of substitution sampling is provided by the 
data augmentation algorithm (Tanner and Wong 1987). If we define 


h i+ i(x) 


/[/ 


fx\Y(x\y)f Y \x{y\x')dy 


hi(x')dx ', 


then from Problem 5.9, h t (x) —> f x (x) as i —»■ oo. 


(a) To calculate h i+ 1 using Monte Carlo integration: 

(i) Generate X'j ~ hj(x'), j = 1, ..., J. 

(ii) Generate, for each x'j, Y jk ~ MxWx';), k = I. K. 

(iii) Calculate h i+1 (x) = j £' =1 ^ Ef=i fx\ Y (x\y jk ). 

Then, A, + i(.r) —> /t i+ i(x)as J, K —»■ oo, and hence the data augmentation algorithm 
converges. 

(b) To implement (a)(i), we must be able to generate a random variable from a mixture 
distribution. Show that if f Y (y) = X^'=i a i ftOO> S cif = 1, then the algorithm 

(i) Select g, with probability a, 

(ii) Generate Y ~ 

produces a random variable with distribution f Y . Hence, show how to implement 
step (a)(i) by generating random variables from fx\Y- Tanner and Wong (1987) 
note that this algorithm will work even if J = 1, which yields the approximation 
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hj + i(x) = fx\r{x\yf), identical to the Gibbs sampler. The data augmentation 

algorithm can also be seen as an application of the process of multiple imputation 
(Rubin 1976, 1987, Little and Rubin 1987). 


5.11 Successive substitution sampling can be implemented via the Gibbs sampler in the 
following way. From Problem 5.8(c), we want to calculate 


h M = Jj^2 k (x\x mk ) ■ 


-iff 

M ^-t] 

m= 1 


fx\r(.x I y)fx\Y (y I X mk ) dy. 


(a) Show that h M (x) —> fx(x) as M —> oo. 

(b) Given x mt , a Monte Carlo approximation to h M (x ) is 

1 m i j 

h M {x)= M 12 

m=l J j =1 

where Y k . ~ f Y \ X {y\x mk ) and h M (x) h M (x) as J -y oo. 

(c) Hence, as M, 7 —>■ oo, h M (x ) -*■ fx(x). 


[This is the Gibbs sampler, which is usually implemented with 7 = 1.] 

5.12 For the situation of Example 5.6, show that 


(a) 




£]£(0|x,r,) 


(b) 


M 


1 M \ / l M 


(c) Discuss when equality might hold in (b). Can you give an example? 


5.13 Show that for the hierarchy (5.5.1), the posterior distributions n(9\x) and 7r(A|x) 
satisfy 


n(9\x) 

n(X\x) 


/ 

/ 


/ 

:/ 


7r(6?|x, 7.)Tr(7|x, 9')dX 
xr(L|x, 9)n(9\x, X')d9 


n(9'\x) d9 ', 
7t(X'\x) dX ', 


and, hence, are stationary points of the Markov chains in (5.5.13). 

5.14 Starting from a uniform random variable U ~ Uniform(0, 1), it is possible to 
construct many random variables through transformations. 


(a) Show that — log U ~ exp(l). 

(b) Show that — ^" =l log [/,- ~ Gamma(n, 1), where Ui . U„ are iid as U(0, 1). 

(c) Let X ~ Exp(n, b). Write X as a function of U. 

(d) Let X ~ Gammaln, f)), n an integer. Write X as a function of Ui, ..., U„, iid as 

U( 0, 1). 


5.15 Starting with a 1/(0, 1) random variable, the transformations of Problem 5.14 will 
not get us normal random variables, or gamma random variables with noninteger shape 
parameters. One way of doing this is to use the Accept-Reject Algorithm (Ripley 1987, 
Section 3.2), an algorithm for simulating X ~ fix): 
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(i) Generate Y ~ g(y), U ~ (7(0, 1), independent. 

(ii) Calculate p(F) = where M = sup, f(t)/g{t). 

(iii) If U < p(Y), set X = Y, otherwise return to i). 

(a) Show that the algorithm will generate X ~ f(x). 

(b) Starting with Y ~ exp(l), show how to generate X ~ N(0, 1). 

(c) Show how to generate a gamma random variable with a noninteger shape parameter. 

5.16 Consider the normal hierarchical model 

X\9i ~n(6>i,of), 

9i\9i ~ n(02|ff 2 2 )> 


^fc-i |6»>t ~ n(0*, ^a 2 ) 
where cr 2 , i = 1, ..., k, are known. 

(a) Show that the posterior distribution of (1 < i < k — 1) is 

n{9i\x, 9 k ) = N (of/.v + (1 - ai)6 k , zf) 
where zf = ('£\crf)(i:f +i crf)/'Efcrj and a, = z?/Z\aj. 

(b) Find an expression for the Kullback-Leibler information AT [7r(@, \x, 9 k ), n(9i\9k)\ 
and show that it is a decreasing function of i. 


5.17 The original proof of Theorem 5.7 (Goel and DeGroot 1981) used Renyi’s entropy 
function (Renyi 1961) 

K(f g) = —r log f f a (x)g l ~ a (x ) dp(x), 

a — l J 

where / and g are densities, p, is a dominating measure, and o' is a constant, a f 1. 

(a) Show that R a (f, g ) satisfies R a (f , g) > 0 and R a {f, f ) = 0. 

(b) Show that Theorem 5.7 holds if R a (f , g) is used instead of K[f, g], 

(c) Show that lim K ^i R a (f , g) = K[f , g], and provide another proof of Theorem 5.7. 


5.18 The Kullback-Leibler information, K[f, g] (5.5.25), is not symmetric in / and g, 
and a modification, called the divergence , remedies this. Define J [/, g], the divergence 
between / and g, to be J [/, g] = K[f , g] + K[g, /]. Show that, analogous to Theorem 
5.7, J[n(f\x),y(k)} < J[jz{0\x),n{6)\. 

5.19 Goel and DeGroot (1981) define a Bayesian analog of Fisher information [see 
(2.5.10)] as 


X[n{9 \x)\ 


[ 

'h^\x) 

Jo. 

n(9\x ) 


dO, 


the information that jc has about the posterior distribution. As in Theorem 5.7, show that 
T[tc(\\x)} < T[n{9\x)\, again showing that the influence of X is less than that of 9. 

5.20 Each of m spores has a probability z of germinating. Of the r spores that germinate, 
each has probability id of bending in a particular direction. If s bends in the particular 
direction, a probability model to describe this process is the bivariate binomial , with 
mass function 


m 

r 


z r ( 1 - z) m - r 


f(r , j[t, a>, m ) = 


o/( i -«y-\ 
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(a) Show that the Jeffreys prior is n j(t, a>) = (1 — t) 1/2 co 1/2 (1 — co) 1/2 . 

(b) If t is considered a nuisance parameter, the reference prior is 

n R (r, a >) = r~ 1/2 (l — t)~ 1/2 £W _1/2 (1 — <w ) _1/2 . 

Compare the posterior means E(co\r, s , m) under both the Jeffreys and reference 
priors. Is one more appropriate? 

(c) What is the effect of the different priors on the posterior variance? 


[Priors for the bivariate binomial have been considered by Crowder and Sweeting (1989), 
Poison and Wasserman (1990), and Clark and Wasserman (1993), who propose a refer¬ 
ence/Jeffreys trade-off prior.] 

5.21 Let T = [f(x\9y, 9 e £2} be a family of probability densities. The Kullback-Leibler 
information for discrimination between two densities in T can be written 


im,e 2 ) = 


/ 


/0|6>i)log 


'me i)' 
J(x\e 2 )_ 


dx. 


Recall that the gradient of i/r is V\{r = {(3/3tV)i/r} and the Hessian is VV^ = 
{(3 2 /96> i 36» 7 )Vr}. ^ 


(a) If integration and differentiation can be interchanged, show that 

Vf(9, 9) = 0 and det[VV^(6», 9)] = 1(9), 

where 1(9) is the Fisher information of f(x\9). 

(b) George and McCulloch (1993) argue that choosing n(9) = (det[VVi/r(6k #)]) 1/2 is 
an appealing least informative choice of priors. What justification can you give for 
this? 


Section 6 

6.1 For the model (3.3.1). show that 5 a (.t)| a= x = 8~ Hx \ where the Bayes estimator 5 A (.v) 
minimizes f L[9 , d(x)]Tz(9\x, X) d9 and the empirical Bayes estimator <5 A(jr) minimizes 
/ L[9,d(x)]jr(9\x,i)d9. 

6.2 This problem will investigate conditions under which an empirical Bayes estimator 
is a Bayes estimator. Expression (6.6.3) is a true posterior expected loss if n(9\x, L(x)) 
is a true posterior. 

From the hierarchy 

X\9 ~ f(x\9), 

©|L ~ 7r(6*|A), 

define the joint distribution of X and © to be (X, 9) ~ g(x, 9) = f(x\9)n(9\\(x)), where 
tz(9\X(x)) is obtained by substituting X(x) for X in n(9\X). 

(a) Show that, for this joint density, the formal Bayes estimator is equivalent to the 
empirical Bayes estimator from the hierarchical model. 

(b) If f(-\9) and tt(-[L) are proper densities, then f g(x,9)d9 < oo. However, 

/ / g(x, 9)dxd9 need not be finite. 

6.3 Forthemodel(6.3.1),theBayesestimator3 A (x)minimizes f L(9, d(x)) x n(9\x, X)d9 
and the empirical Bayes estimator, <5 A (v), minimizes f L(9, d(x))n(9\x, X(x))d9. Show 
that 5 a (.t)| a=AU) = 3 A (x). 
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6.4 For the situation of Example 6.1: 


(a) Show that 



- n /2<r 2 (i-0) 2 e (-l/2)9 2 /r 2 rf(? = ^ 


a- + r- 


1/2 

e (-nl2)x 2 lo 2 +nT 2 


and, hence, establish (6.6.4). 

(b) Verify that the marginal MLE of o 2 + nx 2 is nx 2 and that the empirical Bayes 
estimator is given by (6.6.5). 


6.5 Referring to Example 6.2: 


(a) Show that the Bayes risk, r(n, 8"), of the Bayes estimator (6.6.7) is given by 


r(n, 8”) = kE[var(p k \x k )\ 


kab 

(a + b)(a + b + 1 )(a + b + n) 


(b) Show that the Bayes risk of the unbiased estimator X/n = (X, /n, ..., X k /n ) is 
given by 


r(7z, X/n) 


kab 

n(a + b + l)(fl + b) 


6.6 Extend Theorem 6.3 to the case of Theorem 3.2; that is, if X has density (3.3.7) and 
ij has prior density n(r]\y), then the empirical Bayes estimator is 


E (^n, ? X) |x. y(x)") = logw(x|j/(x)) - ——— log/i(x), 

\ O JC j J O Xt j O JC j 

where /n(x|y) is the marginal distribution of X and j/(x) is the marginal MLE of y. 

6.7 (a) For p,(x) of (1.5.2), show that for any prior distribution n(r)\X) that is dependent 
on a hyperparameter X, the empirical Bayes estimator is given by 


E 


I),—r ; (x) \x,x 

ax j 


dx 


d - 9 

log ”L(x|k(x)) - — logn(x). 


dxj 


where m„(x) = / pg(x)7i(9) dd. 

(b) If X has the distribution po(x) of (1.5.1), show that a similar formulas holds, that 
is, 

E(rm\h = viogtMxii) - vio g /i(x), 


where T = {97}/9x J } is the Jacobian of Tand Va is the gradient vector of a , that is, 
Va = {da/dXi}. 

6.8 For each of the following situations, write the empirical Bayes estimator of the natural 
parameter (under squared error loss) in the form (6.6.12), using the marginal likelihood 
estimator of the hyperparameter X. Evaluate the expressions as far as possible. 


(a) Xi ~ N(0, a 2 ), i = 1, ..., p, independent; l/rr ; 2 ~ Exponential(A). 

(b) X ; ~ N(0,. 1). i = 1, ..., p, independent. 6>; ~ £>£(0, L). 


6.9 Strawderman (1992) shows that the James-Stein estimator can be viewed as an empir¬ 
ical Bayes estimator in an arbitrary location family. Let X pxi ~ f(x — 9), with EX = 9 
and var X = a 2 1. Let the prior b e 9 ~ /*", the n -fold convolution of / with itself. [The 
convolution of / with itself is f* 2 (x) = f f(x — y)f(y)dy. The n-fold convolution is 
/*"(.*) = / /*(« - l)(x)(x - y)f(y)dy.] Equivalently, let U t ~ /, i = 0, ■ ■ ■, n, iid, 
9 = f£,andX = t/o-H9. 
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(a) Show that the Bayes rule against squared error loss is -jE-x. Note that n is a prior 
parameter. 

(b) Show that \H\ 2 / (pa 2 ) is an unbiased estimator of n + 1, and hence that an empirical 
Bayes estimator of 9 is given by S EB = [1 — (pa 2 /|x| 2 )]x. 


6.10 Show for the hierarchy of Example 3.4, where a 2 andr 2 are known but p is unknown, 
that: 


2 

(a) The empirical Bayes estimator of , based on the marginal MLE of 6 ,, is 2 T ~ 2 Y, + 

jLjr. 

cr 2 +r 2 

(b) The Bayes risk, under sum-of-squared-errors loss, of the empirical Bayes estimator 
from part (a) is 


pa 


2 


2 (p - 1 ) 2 ct 4 
p(a 2 + t 2 ) 


+ {p~ 1) 



(c) The minimum risk of the empirical Bayes estimator is attained when all 0, s are 
equal. 

[Hint: Show that £f =I E ^ x - ~ *) 2 ] = E?= i(ft - V 2 +(P~ ] 

6.11 For £(©|x) of (5.5.8), show that as v —>■ oo, E(@\x) —>■ \p/(p + a 2 )]x, the Bayes 
estimator under a N(0, 1) prior. 

6.12 (a) Show that the empirical Bayes S EB (x) = (1 — a 2 / maxjff 2 , px 2 ))x of (6.6.5) 
has bounded mean squared error. 

(b) Show that a variation of S EB (x), of part (a), S v (x) = [1 — ff 2 /(u + px 2 )]x, also has 
bounded mean squared error. 

(c) For a 2 = r 2 = 1, plot the risk functions of the estimators of parts (a) and (b). 
[Thompson (1968a, 1968b) investigated the mean squared error properties of estimators 
like those in part (b). Although such estimators have smaller mean squared error than x 
for small values of 6, they always have larger mean squared error for larger values of 9.] 

6.13 (a) For the hierarchy (5.5.7). with a 2 = 1 and p = 10, evaluate the Bayes risk 
r(n, S n ) of the Bayes estimator (5.5.8) for v = 2, 5, and 10. 

(b) Calculate the Bayes risk of the estimator 5” of Problem 6.12(b). Find a value of 
v that yields a good approximation to the risk of the hierarchical Bayes estimator. 
Compare it to the Bayes risk of the empirical Bayes estimator of Problem 6.12(a). 

6.14 Referring to Example 6.6, show that the empirical Bayes estimator is also a hierar¬ 
chical Bayes estimator using the prior y(b) = 1 /b. 

6.15 The Taylor series approximation to the estimator (5.5.8) is carried out in a number 
of steps. Show that: 


(a) Using a first-order Taylor expansion around the point x, we have 
1 1 


(1+6 P/ V )WV 2 (1 +x 2 /v)< u+1 >/ 2 

V + 1 X 


v (l+x 2 /v)( v+ W 2 


(9 — x ) + R(9 — x) 


where the remainder. R(9 — x), satisfies R(9 — x)/(G — x) 2 —+ 0 as 9 —> x. 
(b) The remainder in part (a) also satisfies 

R(6 - x)e~£ lW ~* )1 d9 = 0(l/p 3/2 ). 


£ 
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(c) The numerator and denominator of (5.5.8) can be written 


Loo (i+e 2 /v)h ,+1)/2 


- ^W-x ) 2 ro 

e = 


xJ2no 2 /p /I 

(l+jt 2 /v)0’ +1 >/2 Vp 3/2 


/ -y-T77 e ^ ^ de 

Loo (l+0 2 /v)(«W 2 

y/lTta 1 ^ _ (v + l)/v 

(1 + T 2 /v) (1 ' +| V 2 _ (l+.i 2 /u) 

which yields (5.6.32). 


>°(L) 


6.16 For the situation of Example 6.7: 


(a) Calculate the values of the approximation (5.6.32) for the values of Table 6.2. Are 
there situations where the estimator (5.6.32) is clearly preferred over the empirical 
Bayes estimator (5.6.5) as an approximation to the hierarchical Bayes estimator 

(5.5.8) ? 

(b) Extend the argument of Problem 6.15 to calculate the next term in the expansion and, 
hence, obtain a more accurate approximation to the hierarchical Bayes estimator 

(5.5.8) . For the values of Table 6.2, is this new approximation to (5.5.8) preferable 
to (5.6.5) and (5.6.32)? 


6.17 (a) Show that if b(-) has a bounded second derivative, then 




where h(X) is the unique minimum of h(X), h"{X) y' 0, and nh(X) —>■ constant as 
n —> oo. 

[Hint: Expand bothb(-) and h(-) in Taylor series around X, up to second-order terms. 
Then, do the term-by-term integration.] 

This is the Laplace approximation for an integral. For refinements and other de¬ 
velopments of this approximation in Bayesian inference, see Tierney and Kadane 
1986, Tierney, Kass, and Kadane 1989, and Robert 1994a (Section 9.2.3). 

(b) For the hierarchical model (5.5.1), the posterior mean can be approximated by 

T 2 1 1/2 / i \ 

E(0 W = -=^r- E(®\x, X) + O I — ) 

L nh"(X) J V» 3/2 / 

where /; = - logrr(>.|x) and X is the mode of 7t(Xlx), the posterior distribution of 
X. 

(c) If n(X\x) is the normal distribution with mean X and variance a 1 = [—(9 2 /9A 2 ) x 
log^a|.v)| i=i ]- 1 , then E(@\x) = £(© \x, X) + 0(l/n 3 / 2 ). 

(d) Show that the situation in part (c) arises from the hierarchy 

x,\e, ~N(ei,o 2 ), 

9j\X ~ N(X, t 2 ), 

X ~ Uniform)—oo, oo). 


6.18 (a) Apply the Laplace approximation (5.6.33) to obtain an approximation to the 
hierarchical Bayes estimator of Example 6.6. 
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(b) Compare the approximation from part (a) with the empirical Bayes estimator 
(5.6.24). Which is a better approximation to the hierarchical Bayes estimator? 

6.19 Apply the Laplace approximation (5.6.33) to the hierarchy of Example 6.7 and show 
that the resulting approximation to the hierarchical Bayes estimator is given by (5.6.32). 

6.20 (a) Verify (6.6.37), that under squared error loss 

r(n, 8) = r(n,8 7, ) + E(8-8 lt ) 2 . 

(b) For X ~ binomial(p, n), L(p, 8) = {p — 8) 2 , and n = {n : n = beta(a, b), a > 0, 
b > 0), determine whether p = x/n or 8° = (ao + x)/(ao + bo + n) is more robust, 
according to (6.6.37). 

(c) Is there an estimator of the form (c + x)/{c + d + n) that you would consider more 
robust, in the sense of (6.6.37), than either estimator in part (b)? 

[In part (b), for fixed n and (a o, bo), calculate the Bayes risk of p and 8° for a number 
of (a, b ) pairs.] 

6.21 (a) Establish (6.6.39) and (6.6.40) for the class of priors given by (6.6.38). 

(b) Show that the Bayes estimator based on n{9) e tt in (6.6.38), under squared error 
loss, is given by (6.6.41). 


Section 7 

7.1 For the situation of Example 7.1: 


(a) The empirical Bayes estimator of 8, using an unbiased estimate of r 2 /(a 2 + r 2 ), is 

(p-2)o 2 ' 

|x| 2 


1 - 


X, 


the James-Stein estimator. 

(b) The empirical Bayes estimator of 0 , using the marginal MLE of r 2 /(cr 2 + r 2 ), is 



which resembles the positive-part Stein estimator. 


7.2 Establish Corollary 7.2. Be sure to verify that the conditions on g(x) are sufficient to 
allow the integration-by-parts argument. [Stein (1973, 1981) develops these representa¬ 
tions in the normal case.] 

7.3 The derivation of an unbiased estimator of the risk (Corollary 7.2) can be extended 
to a more general model in the exponential family, the model of Corollary 3.3, where 
X = Xi, ..., X p has the density 

pri{x) = e^= i ’>' X <- A W h(x). 


(a) The Bayes estimator of r ], under squared error loss, is 

3 3 

E(rji |x) = — logm(x) - — log/i(x). 
aXi dxi 

Show that the risk of E(i] |X)] has unbiased estimator 


E 


i 2 

dx 


v , 3 

^72 (1 °g h M - 2 lo 8 m ( x )) + ( ^7 !°g w( x ) 


[Hint. Theorem 3.5 and Problem 3.4.] 
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(b) Show that the risk of the empirical Bayes estimator 

3 - 3 

E(r)t |x, X) = -— log m(x\X(x)) - — log h(x). 
d.Xj 3 Xi 

of Theorem 6.3 has unbiased estimator 


E 


dx; 


(log h(x) - 21ogw(x|k(x))) + ( E log w(x|A(x)) 


(c) Use the results of part (b) to derive an unbiased estimator of the risk of the positive- 
part Stein estimator of (7.7.10). 

7.4 Verify (7.7.9), the expression for the Bayes risk of 3 r °. (Problem 3.12 may be helpful.) 

7.5 A general version of the empirical Bayes estimator (7.7.3) is given by 

-_2 s 


3 c (x) =1-- x, 

' [ x | “ 1 


where c is a positive constant. 

(a) Use Corollary 7.2 to verify that 


E e \6 - S C (X)| 2 = pa~+ ca\c - 2(p - 2)]£» — . 

(b) Show that the Bayes risk, under 0 ~ N p { 0, r 2 /), is given by 


r(n , S c ) = a~ 


P + 


a 2 + t 2 \p — 2 

and is minimized by choosing c = p — 2. 


- 2 


7.6 For the model 

X |e ~ N p (6, a 2 I), 
6 \t 2 ~ N p (p,,T 2 I) : 

Show that: 


(a) The empirical Bayes estimator, using an unbiased estimator of r 2 /{a 2 + r 2 ), is the 
Stein estimator 


Sj s (x) = ^ + 



(P ~ 2)a 2 \ 

E{Xi-p,) 2 ) 


(Xj - Pi). 


(b) If p > 3, the Bayes risk, under squared error loss, of <5 JS is r( t, 3 is ) = r(z, S z ) + 
2ff 4 /{a 2 + t 2 ), where r(r, S T ) is the Bayes risk of the Bayes estimator. 

(c) If p < 3, the Bayes risk of 3 ,s is infinite. [Hint: Show that if Y ~ / 2 , E(\/Y) < 
oo <==> m < 3], 


7.7 For the model 


X|6> ~ N p {0, a 2 1), 
0\r 2 ~ N(p, r 2 I) 


the Bayes risk of the ordinary Stein estimator 


S,(x) = pi + 


1 - 


(p - 2)g 2 \ 

E(x, - pi) 2 ) 


(Xi - pi ) 
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is uniformly larger than its positive-part version 


5, + (x) = /x, + 


( l (P ~ 2)<r 2 
V £(.r, - M /) 2 


(Xi - Pi). 


7.8 Theorem 7.5 holds in greater generality than just the normal distribution. Suppose 
X is distributed according to the multivariate version of the exponential family p^x) of 
(33.7), 

p,(x) = e’ 1 x ~ A ^h(x), —oo < Xi < oo, 
and a multivariate conjugate prior distribution [generalizing (3.19)] is used. 


(a) Show that E(X\t]) = WA(if). 

(b) If [i = 0 in the prior distribution (see 3.19), show that r(x, S) > r(r, 5 + ), where 
c5(x) = [1 — B(x)]x and <5 + (x) = [1 — £(x)] + x. 

(c) If p 0, the estimator <5(x) would be modified to p + 5(x — p). Establish a result 
similar to part (b) for this estimator. 


[Hint: For part (b), the proof of Theorem 7.5. modified to use the Bayes estimator 
E(VA(i])\x, k, p) as in (3.21), will work.] 

7.9 (a) For the model (7.7.15), show that the marginal distribution of X, is negative 
binomial(fl, 1 /b + 1); that is. 


P{Xi=x)= (a+*-' 

with EXi = ab and var X, = ab(b + 1). 


b+ 1 


b+ 1 


(b) If Xi, ..., X,„ are iid according to the negative binomial distribution in part (a), 
show that the conditional distribution of Xj | Xi * s the negative hypergeometric 
distribution, given by 


P X ; =.r|£x, 


a + x — 1 j / (m — 1 )a +1 — x — 1 
V t — x 

ma + 1—1 
t 

with EXj = t/m and var Xj = (m — 1 )t(ma + t)/m 2 (ma + 1 ). 

7.10 For the situation of Example 7.6: 


(a) Show that the Bayes estimator under the loss Lk{k, 5) of (7.7.16) is given by 
(7.7.17). 

(b) Verify (7.7.19) and (7.7.20). 

(c) Evaluate the Bayes risks r(0, <5*) and r( I. S°). Which estimator, 5° or <5*, is more 
robust? 


7.11 For the situation of Example 7.6, evaluate the Bayes risk of the empirical Bayes 
estimator (7.7.20) for k = 0 and 1. What values of the unknown hyperparameter b are 
least and which are most favorable to the empirical Bayes estimator? 

[Hint: Using the posterior expected loss (7.7.22) and Problem 7.9(b), the Bayes risk can 
be expressed as an expectation of a function of SX ; only. Further simplification seems 
unlikely.] 
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7.12 Consider a hierarchical Bayes estimator for the Poisson model (7.7.15) with loss 
(7.7.16). Using the distribution (5.6.27) for the hyperparameter b, show that the Bayes 
estimator is 

px + a — k 
px + pa + a + /3 — k 
[Hint: Show that the Bayes estimator is £(A. 1-t |x)/l?(A, - *|x) and that 

_,, r , , T{px + pa +ct + fi)r(px +ct + r)r{a + Xi+r) 

T(px + a)T(px + pa + a + fi + r) r(a+jr ; ) 

7.13 Prove the following: Two matrix results that are useful in calculating estimators 
from multivariate hierarchical models are 

(a) For any vector a of the form a = (/ — j J)b, 1 'a = Ea, = 0. 

(b) If B is an idempotent matrix (that is. B 2 = I) and a is a scalar, then 

(I +aB)~ l = 1 --— B. 

1 + a 

7.14 For the situation of Example 7.7: 

(a) Show how to derive the empirical Bayes estimator S L of (7.7.28). 

(b) Verify the Bayes risk of S L of (7.7.29). 

For the situation of Example 7.8: 

(c) Show how to derive the empirical Bayes estimator S EB2 of (7.7.33). 

(d) Verify the Bayes risk of S EB2 , (7.7.34). 

7.15 The empirical Bayes estimator (7.7.27) can also be derived as a hierarchical Bayes 
estimator. Consider the hierarchical model 

Xij\& ~ V(f,V), 7 = 1. n, i = l,..., s, 

ft |/4 ~ N(fl, r 2 ), i=l,...,s, 
p. ~ Uniform(—oo, oo) 

where a 1 and r 2 are known. 


^ (a + Xi — k). 


(a) Show that the Bayes estimator, with respect to squared error loss, is 


£(?i|x) = 


-E(p |x)+ ■ 


tXi 


a z + n r- a-+n r 1 

where E{p |x) is the posterior mean of p. 

(b) Establish that E(p |x) = x = Ejc ij/ns. [This can be done by evaluating the expec¬ 
tation directly, or by showing that the posterior distribution of |x is 


£;|x ~ N 


-x + 


a-+nx z a z +nr- a- + n r- 
Note that the ’s are not independent a posteriori. In fact. 


2 

nr h- 

i 


il*~ N s 

where M = I + (cr 2 /nz 2 )J% 


n r- 


-M, 


-M 
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(c) Show that the empirical Bayes estimator (7.7.32) can also be derived as a hierar¬ 
chical Bayes estimator, by appending the specification (a, /l) ~ UniformOR 2 ) [that 
is, n(a, P) = da dp , —oo < a, p < oo] to the hierarchy (7.7.30). 

7.16 Generalization of model (7.7.23) to the case of unequal n, is, perhaps, not as straight¬ 
forward as one might expect. Consider the generalization 


N&,a 2 ), j 

-2x 


1, 


, Hj , 7 = 1 , 


Xijfo 

i;i\n ~ N(ll, T 2 ), i = l,...,S. 

We also make the assumption that r f = T 2 /n,-. Show that: 
(a) The above model is equivalent to 


N s (X,a I), 

.2 , 


Y 

X ~ N s (Zfi, r-/) 
where 7, = ^nJXj,Xi = and z = U/n7, ..., Vn s )'. 


(b) The Bayes estimator of using squared error loss, is 


rM + 


(c) The marginal distribution of T, is 7, 
estimator of § is 

=jr + | 1 - 


N s (Zfj., (a 2 + r 2 )/), and an empirical Bayes 
(s — 3)cr 2 


- -*) 


0; - *) 


[Without the assumption that r 2 = r 2 /nj, one cannot get a simple empirical Bayes 
estimator. If r 2 = r 2 , the likelihood estimation can be used to get an estimate of r 2 to 
be used in the empirical Bayes estimator. This is discussed by Morris (1983a).] 

7.17 (Empirical Bayes estimation in a general case). A general version of the hierarchical 
models of Examples 7.7 and 7.8 is 

X|4~ A^.ff 2 /), 

$\P ~N s (Zp,r 2 I) 

where a 2 and Z sxr , of rank r, are known and r 2 and p rxl are unknown. Under this 
model show that: 

(a) The Bayes estimator of £, under squared error loss, is 


tzP + 


E($\X'P)= z . 

a- + r- a- + x L 

(b) Marginally, the distribution of X\P is X|/? ~ N S (ZP , (a 2 + r 2 )/). 

(c) Under the marginal distribution in part (b), 


£[(Z'Zr‘Z'x] = E~P = p. 


s — r — 2 

\x-zp \ 2 


and, hence, an empirical Bayes estimator of £ is 

5' 


. (s — r — 2)a 2 . 

Zj8 + 1 — 2 -(x - Z p). 


\x-ZP \ 2 
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(d) The Bayes risk of S EB is r(r, S T ) + ( r + 2)cr 4 /(cr 2 + r 2 ), where r(r, <5 T ) is the risk 
of the Bayes estimator. 

7.18 (Hierarchical Bayes estimation in a general case.) In a manner similar to the previous 
problem, we can derive hierarchical Bayes estimators for the model 

X|| ~ 1V S (£, a 2 1), 

$\P ~N,(ZP,r 2 I), 

P ~ Uni form OR r ) 

where a 2 and Z sxr , of rank r, are known and r 2 is unknown. 


(a) The prior distribution of £, unconditional on /?, is proportional to 

_ 1 S'd-HX 


-L 


oc e 




where H = Z(Z'Z) *Z' projects from 1R S to lR r . 

[Hint: Establish that 

(I - Z/J)'(5 - Z/J) = f'(/ - 

+[j8 - (Z'Zr'Z'SJ'Z'Zt/J - (Z'Z) _1 Z'$] 

to perform the integration on /?.] 

(b) Show that 




-M, 


rr 2 + r 2 a 2 + t 2 


-M 


where M = I + (a 2 /t 2 )H , and hence that the Bayes estimator is given by 


2 2 

a r 

-- -Hx+ — --x. 


cr z + r z 


(j z + r z 


where Z/$ = Hx. 
[Hint: Establish that 


t- a 2 


cr 2 + r 2 


f- 


tr- + 


— Mx \ M~' ( £-—-Mx'l 

t 2 ) y a 2 + t 2 ) 


-x'(I - H)x 


where M~ l = I - 

(T 2 +T 2 J 

(c) Marginally, X'(/ — H)X ~ (cr 2 + T 2 )x 2 _ r . This leads us to the empirical Bayes 
estimator 

{s — r — 2 )cr 2N 


Tlx +11 — 


x'(/ - H)x 


(x - Hx) 


which is equal to the empirical Bayes estimator of Problem 7.17(c). 

[The model in this and the previous problem can be substantially generalized. For exam¬ 
ple, both a 2 1 and r 2 1 can be replaced by full, positive definite matrices. At the cost of 
an increase in the complexity of the matrix calculations and the loss of simple answers, 
hierarchical and empirical Bayes estimators can be computed. The covariances, either 
scalar or matrix, can also be unknown, and inverted gamma (or inverted Wishart) prior 
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distributions can be accommodated. Calculations can be implemented via the Gibbs 
sampler. 

Note that these generalizations encompass the “unequal n” case (see Problem 7.16), 
but there are no simple solutions for this case. Many of these estimators also possess a 
minimax property, which will be discussed in Chapter 5.] 

7.19 As noted by Morris (1983a), an analysis of variance-type hierarchical model, with 
unequal n,-, will yield closed-form empirical Bayes estimators if the prior variances are 
proportional to the sampling variances. Show that, for the model 

j = 1.«(, « = l....,ff, 

fl/» ~ N s (ZP,t 2 D-') 

where a 2 and Z sxr , of full rank r, are known, t 2 is unknown, and D = diag(n i. n s ), 

an empirical Bayes estimator is given by 

(s-r- 2 )a 2 \ 

----^ (x - Zfi) 

(x — Zff)'D(x — Z/3) j 

with Xj = HjXij/rii, x = {Jr,-}, and j} = (Z'DZ) _1 Z'£)x. 

7.20 An entertaining (and unjustifiable) result which abuses a hierarchical Bayes calcu¬ 
lation yields the following derivation of the James-Stein estimator. Let X ~ N p {6, I) 
and 6\z 2 ~ N p ( 0, r 2 /). 


= Z^ + 1 


(a) Verify that conditional on r 2 , the posterior and marginal distributions are given by 


?r(0|x, r“) = N p 


r“ r~ 
x, —- 1 


y t 2 + 1 r 2 + 1 
m(x|r 2 ) = 1V„[0, (t 2 + 1)/]. 


(b) Show that, taking jt(t 2 ) = 1, — 1 < r 2 < oo, we have 


// 

J JMP 


9jr(6\x, T 2 )m(x\T 2 ) d6 dr 2 


(2tt)p/ 2 (|x| 2 )p/2- 


r rip/Dir' 2 

. V 2 ) |x| 2 


and 


// 

J 


n(0\x, r 2 )m(x\z 2 )d6 dr 2 


1 


(2jr)P/ 2 (|x| 2 )P/ 2 - 


P ~ 2 


2<P-2)/2 


and hence 


£< " |x,= (^af) x - 

(c) Explain some implications of the result in part (b) and, why it cannot be true. [Try 
to reconcile it with (3.3.12).] 


(d) Why are the calculations in part (b) unjustified? 
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9 Notes 

9.1 History 

Following the basic paper by Bayes (published posthumously in 1763). Laplace initiated 
a widespread use of Bayes procedures, particularly with noninformative priors (for 
example, in his paper of 1774 and the fundamental book of 1820; see Stigler 1983, 
1986). However, Laplace also employed non-Bayesian methods, without always making 
a clear distinction. A systematic theory of statistical inference based on noninformative 
(locally invariant) priors, generalizing and refining Laplace’s approach, was developed 
by Jeffreys in his book on probability theory (1st edition 1939, 3rd edition 1961). A 
corresponding subjective theory owes its modern impetus to the work of deFinetti (for 
example, 1937,1970) and that of L. J. Savage, particularly in his book on the Foundations 
of Statistics (1954). The idea of selecting an appropriate prior from the conjugate family 
was put forward by Raiffa and Schlaifer (1961). Interest in Bayes procedures (although 
not from a Bayesian point of view) also received support from Wald’s result (for example, 
1950) that all admissible procedures are either Bayes or limiting Bayes (see Section 5.8). 
Bayesian attitudes and approaches are continually developing, with some of the most 
influential work done by Good (1965), DeGroot(1970), Zellner (1971), deFinetti (1974), 
Box and Tiao (1973), Berger (1985), and Bernardo and Smith (1994). An account of 
criticisms of the Bayesian approach can be found in Rothenberg (1977), and Berger 
(1985, Section 4.12). Robert (1994a, Chapter 10) provides a defense of “The Bayesian 
Choice.” 

9.2 Modeling 

A general Bayesian treatment of linear models is given by Lindley and Smith (1972); 
the linear mixed model is given a Bayesian treatment in Searle et al. (1992, Chapter 
9); sampling from a finite population is discussed from a Bayesian point of view by 
Ericson (1969) (see also Godambe 1982); a Bayesian approach to contingency tables 
is developed by Lindley (1964), Good (1965), and Bloch and Watson (1967) (see also 
Bishop, Fienberg, and Holland 1975 and Leonard 1972). The theory of Bayes estimation 
in exponential families is given a detailed development by Bernardo and Smith (1994). 
The fact that the resulting posterior expectations are convex combinations of sample and 
prior means is a characterization of this situation (Diaconis and Ylvisaker 1979, Goel 
and DeGroot 1980, MacEachern 1993). 

Extensions to nonlinear and generalized linear models are given by Eaves (1983) and 
Albert (1988). In particular, for the generalized linear model, Ibrahim and Laud (1991) 
and Natarajan and McCulloch (1995) examine conditions for the propriety of posterior 
densities resulting from improper priors. 

9.3 Computing 

One reason why interest in Bayesian methods has flourished is because of the great 
strides in Bayesian computing. The fundamental work of Gernan and Geman (1984) 
(which built on that of Metropolis et al. (1953) and Hastings 1970) influenced Gelfand 
and Smith (1990) to write a paper that sparked new interest in Bayesian methods, sta¬ 
tistical computing, algorithms, and stochastic processes through the use of computing 
algorithms such as the Gibbs sampler and the Metropolis-Hastings algorithm. Elemen¬ 
tary introductions to these topics can be found in Casella and George (1992) and Chib 
and Greenberg (1995). More detailed and advanced treatments are given in Tierney 
(1994), Robert (1994b), Gelman et al. (1995). and Tanner (1996). 
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9.4 The Ergodic Theorem 

The general theorem about convergence of (5.15) in a Markov chain is known as the 
Ergodic Theorem ; the name was coined by Boltzmann when investigating the behavior 
of gases (see Dudley 1989. p. 217). A sequence X 0 , X lt X 2 , ... is called ergodic if 
the limit of Xt/n is independent of the initial value of X a . The ergodic theorem 
for stationary sequences, those for which (Xj { ,Xj t ) has the same distribution as 
(X j l+r , ..., X j k+r ) for all r = 1 , 2, ... is an assertion of the equality of time and space 
averages and holds in some generality (Dudley 1989, Section 8.4, Billingsley 1995, 
Section 24). 

As the importance of this theorem led it to have wider applicability, the term “ergodic” 
has come to be applied in many situations and is often associated with Markov chains. In 
statistical practice, the usefulness of Markov chains for computations and the importance 
of the limit being independent of the starting values has brought the study of the ergodic 
behavior of Markov chains into prominence for statisticians. Good entries to the classical 
theory of Markov chains can be found in Feller (1968), Kemeny and Snell (1976), 
Resnick (1992), Ross (1985), or the more advanced treatment by Meyn and Tweedie 
(1993). In the context of estimation, the papers by Tierney (1994) and Robert (1995) 
provide detailed introductions to the relevant Markov chain theory. Athreya, Doss, and 
Sethuraman (1996) rigorously develop limit theorems for Markov chains arising in Gibbs 
sampling-type situations. 

We are mainly concerned with Markov chains X 0 , X 2 ,... that have an invariant dis¬ 
tribution, F, satisfying f A dF{x) = f P(X n+l e A\X„ = x)dF(x). The chain is called 
irreducible if all sets with positive probability under the invariant distribution can be 
reached at some point by the chain. Such an irreducible chain is also recurrent (Tier¬ 
ney 1994. Section 3.1). A recurrent chain is one that visits every set infinitely often 
(i.o.) or, more importantly, a recurrent chain tends not to “drift off' to infinity. For¬ 
mally, an irreducible Markov chain is recurrent if for each A with f A dF(x) > 0, we 
have P(X k e A i.o. |Yo = .to) > 0 f° r all -to. and equal to 1 for almost all ,t 0 (/)• If 
P(X k € A i.o. | Xq = to) = 1 for all xo, the chain is called Harris recurrent. Finally, if 
the invariant distribution F has finite mass (as it will in most of the cases we consider 
here), the chain is positive recurrent, otherwise it is null recurrent. 

The Markov chain is periodic if for some integer m > 2. there exists a collection 
of disjoint sets {Aj, ..., A,„) for which P(X k+ 1 e Aj + \\X k e Aj) = 1 for all j = 
1, ..., m — 1 (mod m). That is, the chain periodically travels through the sets Ai, .... A m . 
If no such collection of sets exists, the chain is aperiodic. 

The relationship between these Markov chain properties and their consequences are 
summarized in the following theorem, based on Theorem 1 of Tierney (1994). 

Theorem 9.1 Suppose that the Markov chain X 0 , Xi,... is irreducible with invariant 
distribution F satisfying f dF(x ) = 1. Then, the Markov chain is positive recurrent and 
F is the unique invariant distribution. If the Markov chain is also aperiodic, then for 
almost all x 0 ( F), 

sup | P(X k e A\X 0 = x 0 ) - / dF(x) | -» 0. 

A Ja 

If the chain is Harris recurrent, the convergence occurs for all Xq. 


It is common to call a Markov chain ergodic if it is positive Harris recurrent and aperiodic. 
For such chains, we have the following version of the ergodic theorem. 
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Theorem 9.2 Let X 0 , X lt X 2 , be an ergodic Markov chain with invariant distribu¬ 

tion F. Then, for any function h with f \h(x)\dF(x) < oo, 

n 

(l/n)2>*i) 

i=l 

9.5 Parametric and Nonparametric Empirical Bayes 

The empirical Bayes analysis considered in Section 6 is sometimes referred to as para¬ 
metric empirical Bayes, to distinguish it from the empirical Bayes methodology de¬ 
veloped by Robbins (1955), which could be called nonparametric. In nonparametric 
empirical Bayes analysis, no functional form is assumed for the prior distribution, but a 
nonparametric estimator of the prior is built up and the resulting empirical Bayes mean 
is calculated. Robbins showed that as the sample size goes to infinity, it is possible to 
achieve the same Bayes risk as that achieved by the true Bayes estimator. Much research 
has been done in this area (see, for example. Van Ryzin and Susarla 1977, Susarla 1982, 
Robbins 1983, and Maritz and Lwin 1989). Due to the nature of this approach, its op¬ 
timality properties tend to occur in large samples, with the parametric empirical Bayes 
approach being more suited for estimation in finite-sample problems. 

Parametric empirical Bayes methods also have a long history, with major developments 
evolving in the sequence of papers by Efron and Morris (1971, 1972a, 1972b, 1973a, 
1973b. 1975,1976a, 1976b), where the connection with minimax estimation is explored. 
The theory and applications of empirical Bayes methods is given by Morris (1983a); 
a more comprehensive treatment is found in Carlin and Louis (1996). Less technical 
introductions are given by Casella (1985a, 1992a). 

9.6 Robust Bayes 

Robust Bayesian methods were effectively coalesced into a practical methodology by 
Berger (1984). Since then, there has been a great deal of research on this topic. (See, for 
example, Berger and Berliner 1986, Wasserman 1989, 1990, Sivaganesen and Berger 
1989, DasGupta 1991. Lavine 1991a, 1991b, and the review papers by Berger 1990b, 
1994 and Wasserman 1994.) The idea of using a class of priors is similar to the gamma- 
minimax approach, first developed by Robbins (1951, 1964) and Good (1952). In this 
approach, the subject of robustness over the class is usually not an issue, but rather the 
objective is the construction of an estimator that is minimax over the class (see Problem 
5.1.2). 


J h{x)dF{x) almost everywhere (F). 
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CHAPTER 5 


Minimaxity and Admissibility 


1 Minimax Estimation 


At the beginning of Chapter 4, we introduced two ways in which the risk function 
R(0, S ) can be minimized in some overall sense: minimizing a weighted-average 
risk and minimizing the maximum risk. The first of these approaches was the 
concern of Chapter 4; in the present chapter, we shall consider the second. 

Definition 1.1 An estimator S M of 6, which minimizes the maximum risk, that is, 
which satisfies 

(1.1) inf sup R(6, S) = sup R(0, S M ), 

s e e 

is called a minimax estimator. 


The problem of finding the estimator S M , which minimizes the maximum risk, 
is often difficult. Thus, unlike what happened in UMVU, equivariant, and Bayes 
estimations, we shall not be able to determine minimax estimators for large classes 
of problems but, rather, will treat problems individually (see Section 5.4). 


Example 1.2 A first example. As we will see (Example 2.17), the Bayes estima¬ 
tors of Example 4.1.5, given by (4.1.12), that is. 


( 1 . 2 ) 


■5 a (x) 


a + x 
a + b + n 


are admissible. Their risk functions are, therefore, incomparable as they all must 
cross (or coincide). As an illustration, consider the group of three estimators 
8 n ‘, i = 1, ..., 3, Bayes estimators from beta(l, 3), beta(2, 2) and beta(3, 1) pri¬ 
ors, respectively. Based on this construction, each S n ‘ will be preferred if it is 
thought that the true value of the parameter is close to its prior mean (1/4, 1/2, 
3/4, respectively). Alternatively, one might choose S 772 since it can be shown that 
S 772 has the smallest maximum risk among the three estimators being considered 
(see Problem 1.1). Although S 772 is minimax among these three estimators, it is 
not minimax overall. See Problems 1.2 and 1.3 for an alternative definition of 
minimaxity where the class of estimators is restricted. i 


As pointed out in Section 4. l(i), and suggested by Example 1.2, Bayes estima¬ 
tors provide a tool for solving minimax problems. Thus, Bayesian considerations 
are helpful when choosing an optimal frequentist estimator. Viewed in this light, 
there is a synthesis of the two approaches. The Bayesian approach provides us 
with a means of constructing an estimator that has optimal frequentist properties. 
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This synthesis highlights important features of both the Bayesian and frequentist 
approaches. The Bayesian paradigm is well suited for the construction of possi¬ 
bly optimal estimators, but is less well suited for their evaluation. The frequentist 
paradigm is complementary, as it is well suited for risk evaluations, but less well 
suited for construction. It is important to view these two approaches and hence the 
contents of Chapters 4 and 5 as complementary rather than adversarial; together 
they provide a rich set of tools and techniques for the statistician. 

If we want to apply this idea to the determination of minimax estimators, we 
must ask ourselves: For what prior distribution A is the Bayes solution <5 A likely to 
be minimax? A minimax procedure, by minimizing the maximum risk, tries to do 
as well as possible in the worst case. One might, therefore, expect that the minimax 
estimator would be Bayes for the worst possible distribution. To make this concept 
precise, let us denote the average risk (Bayes risk) of the Bayes solution <$ A by 

(1.3) r A = r(A,5 A ) = J RW, 8 A ) dA(6). 

Definition 1.3 A prior distribution A is least favorable if r A > r A ' for all prior 
distributions A'. 


This is the prior distribution which causes the statistician the greatest average 
loss. 

The following theorem provides a simple condition for a Bayes estimator S A to 
be minimax. 


Theorem 1.4 Suppose that A is a distribution on © such that 


(1.4) 


r( A, 8a) 


/ 


R(d, 8 A ) dA(9) = sup R(9, S A ). 

g 


Then: 

(i) 8 a is minimax. 

(ii) If 8 a is the unique Bayes solution with respect to A, it is the unique minimax 
procedure. 

(Hi) A is least favorable. 

Proof. 

(i) Let 8 be any other procedure. Then, 


sup R(0, 8) > 


/ 

/ 


R(9,8)dA(6) 

R(9, S A ) dA(9) = sup R(9, S A ). 

g 


(ii) This follows by replacing > by > in the second equality of the proof of (i). 

(iii) Let A' be some other distribution of 0. Then, 


rA' 


J R(9,8 A ')dA'(9) < 


J R(9,8 A )dA'(9) 


< sup R(6, 8 a ) = r A . 

g 


□ 
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Condition (1.4) states that the average of RiO. 8 a) is equal to its maximum. This 
will be the case when the risk function is constant or, more generally, when A 
assigns probability 1 to the set on which the risk function takes on its maximum 
value. The following minimax characterizations are variations and simplifications 
of this requirement. 

Corollary 1.5 If a Bayes solution 8 a has constant risk, then it is minimax. 

Proof. If 8 a has constant risk, (1.4) clearly holds. □ 

Corollary 1.6 Let a> a be the set of parameter points at which the risk function of 
5a takes on its maximum, that is, 

(1.5) a) A = {0 : R(0 , 8 a ) = sup R(0 ', 8 A )}. 

6 ' 

Then, 8 a is minimax if and only if 

(1.6) A(<w a ) = 1. 

This can be rephrased by saying that a sufficient condition for <5 A to be minimax 
is that there exists a set co such that 

A (co) = 1 
and 

(1.7) R(9, (5 a ) attains its maximum at all points of co. 


Example 1.7 Binomial. Suppose that X has the binomial distribution b(p, n ) and 
that we wish to estimate p with squared error loss. To see whether X/n is minimax, 
note that its risk function p( 1 — p)/n has a unique maximum at p = 1 /2. To apply 
Corollary 1 .6, we need to use a prior distribution A for p which assigns probability 
1 to p = 1/2. The corresponding Bayes estimator is 8(X) = 1/2, not X/n. Thus, 
if X/n is minimax, the approach suggested by Corollary 1.6 does not work in the 
present case. It is, in fact, easy to see that X/n is not minimax (Problem 1.9). 

To determine a minimax estimator by the method of Theorem 1.4, let us utilize 
the result of Example 4.1.5 and try a beta distribution for A. If A is B(a, b), the 
Bayes estimator is given by (4.1.12) and its risk function is 

(1.8) -- p -ry{np( 1 - P) + [a( 1 - P) ~ bp] 2 }. 

(a + b + nf 

Corollary 1.5 suggests seeing whether there exist values a and b for which the risk 
function (1.8) is constant. Setting the coefficients of p 2 and p in (1.8) equal to zero 
shows that (1.8) is constant if and only if 

(1.9) (a + b) 2 = n and 2 a(ci + b) = n. 

Since a and b are positive, a + b = fn and, hence, 

1 _ 
a = b = —Jn. 

2 


( 1 . 10 ) 
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It follows that the estimator 

X+\Jn X Jn 11 

( 1 . 11 ) 5 =- = -^= + -- 7 = 

n + *Jn n 1 + x /n 2 1 + *Jn 

is constant risk Bayes and, hence, minimax. Because of the uniqueness of the 
Bayes estimator (4.1.4), it is seen that (1.11) is the unique minimax estimator of 
P- 

Of course, the estimator ( 1 . 11 ) is biased (Problem 1.10) because X/n is the only 
unbiased estimator that is a function of X. A comparison of its risk, which is 

d- 12 ) r„ = E(S- p) 2 = [ - 1 

4 (1 + Vn) 


with the risk function 

(1.13) R„(p) = p{\ - p)/n 

of X/n shows that (Problem 1.11 ) r n < R n {p) in an interval I„ = (1/2 — c„ < 
p < 1/2 + c„) and r n > R„(p) outside /„. For small values of n, c n is close to 1 /2, 
so that the minimax estimator is better (and, in fact, substantially better) for most 
of the range of p. However, as n -+ oo, c„ —» 0 and /„ shrinks toward the point 
1/2. Furthermore, sup ; , R„{p)/r n = R n {\/2)/r n -+ 1, so that even at p = 1/2, 
where the comparison is least favorable to X/n, the improvement achieved by the 
minimax estimator is negligible. Thus, for large and even moderate n, X/n is the 
better of the two estimators. In the limit as n —> oo (although not for any finite n), 
X/n dominates the minimax estimator. Problems for which such a subminimax 
sequence does not exist are discussed by Ghosh (1964). 

The present example illustrates an asymmetry between parts (ii) and (iii) of 
Theorem 1.4. Part (ii) asserts the uniqueness of the minimax estimator, whereas 
no such claim is made in part (iii) for the least favorable A. In the present case, it 
follows from (4.1.4) that for any A, the Bayes estimator of p is 


(1.14) 


fo P x+l ( 1 - p) n ~ x dA(p) 
fo P x ( 1 - p)"~ x dA(p) 


Expansion of (1 — p) n ~ x in powers of p shows that <5 a(x ) depends on A only 
through the first n + 1 moments of A. This shows, in particular, that the least 
favorable distribution is not unique in the present case. Any prior distribution with 
the same first n +1 moments gives the same Bayes solution and, hence, by Theorem 
1.4 is least favorable (Problem 1.13). 

Viewed as a loss function, squared error may be unrealistic when estimating p 
since in many situations an error of fixed size seems much more serious for values 
of p close to 0 or 1 than for values near 1/2. To take account of this difficulty, let 


(1.15) 


L(p, d) = 


(d - p) 2 
pd-pV 


With this loss function, X/n becomes a constant risk estimator and is seen to be 
a Bayes estimator with respect to the uniform distribution on (0, 1) and hence a 
minimax estimator. It is interesting to note that with (1.15), the risk function of the 
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estimator (1.11) is unbounded. This indicates how strongly the minimax property 
can depend on the loss function. I! 

When the loss function is convex in d, as was the case in Example 1.7, it follows 
from Corollary 1.7.9 that attention may be restricted to nonrandomized estimators. 
The next example shows that this is no longer true when the convexity assumption 
is dropped. 

Example 1.8 Randomized minimax estimator. In the preceding example, sup¬ 
pose that the loss is zero when \d — p\ < a and is one otherwise, where a < 

1 /2(n + 1). Since any nonrandomized 8(X) can take on at most n + 1 distinct val¬ 
ues, the maximum risk of any such <5 is then equal to 1. To exhibit a randomized 
estimator with a smaller maximum risk, consider the extreme case in which the 
estimator of p does not depend on the data at all but is a random variable U, which 
is uniformly distributed on (0, 1). The resulting risk function is 

(1.16) R(p, U) = 1 - P(\U - p\ < a) 

and it is easily seen that the maximum of (1.16) is 1 — a < 1 (Problem 1.14). || 

The loss function in this example was chosen to make the calculations easy, but 
the possibility of reducing the maximum risk through randomization exists also 
for other nonconvex loss functions. In particular, for the problem of Example 1.7 
with loss function \d — p\ r (0 < r < 1), it can be proved that no nonrandomized 
estimator can be minimax (Hodges and Lehmann 1950). 

Example 1.9 Difference of two binomials. Consider the case of two independent 
variables X and Y with distributions b( p \, m ) and h(py, n ), respectively, and the 
problem of estimating p 2 — p\ with squared error loss. We shall now obtain the 
minimax estimator when m = n; no solution is known when m A n. 

The derivation of the estimator in Example 4.1.5 suggests that in the present 
case, too, the minimax estimator might be a linear estimator aX + bY + k with 
constant risk. However, it is easy to see (Problem 1.18) that such a minimax 
estimator does not exist. Still hoping for a linear estimator, we shall therefore try 
to apply Corollary 1.6. Before doing so, let us simplify the hoped-for solutions by 
an invariance consideration. 

The problem remains invariant under the transformation 

(1.17) (X', Y') = (Y, X), {p' l ,p' 1 ) = {p 2 ,p l ), d'=-d, 

and an estimator <5(A, Y) is equivariant under this transformation provided (5(T, X) = 
—8(X, Y) and hence if 

(a + b)(x + v) + 2k = 0 for all x,y. 

This leads to the condition a+b = k = 0 and, therefore, to an estimator of the form 

(1.18) 8(X,Y) = c(Y - X). 

As will be seen in Section 5.4 (see Theorem 4.1 and the discussion following it), 
if a problem remains invariant under a finite group G and if a minimax estimator 
exists, then there exists an equivariant minimax estimator. In our search for a linear 
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minimax estimator, we may therefore restrict attention to estimators of the form 
(1.18). 

Application of Corollary 1.6 requires determination of the set u> of pairs (p \, P2) 
for which the risk of (1.18) takes on its maximum. The risk of (1.18) is 

RciPu Pi) = E[c(Y - X) - ( p 2 - pi)] 2 

= c 2 n(pi(l - p^ + p 2 ( 1 - p 2 )) + (cn - l) 2 (pi ~ P\) 2 ■ 


Taking partial derivatives with respect to p\ and p 2 and setting the resulting ex¬ 
pressions equal to 0 leads to the two equations 

[2(cn — l) 2 — 2c 2 n]pi — 2 (cn — l) 2 p 2 = —c 2 n , 

(1.19) —2 (cn — l) 2 pi + [2(cn — l) 2 — 2c 2 n]p 2 = —c 2 n. 

Typically, these equations have a unique solution, say ( p °, p !?), which is the point of 
maximum risk. Application of Corollary 1.6 would then have A assign probability 
1 to the point (p®, p ?) and the associated Bayes estimator would be <5(X, Y ) = 
P 2 — Pi, whose risk does not have a maximum at (p\, p®). 

This impasse does not occur if the two equations (1.19) are linearly dependent. 
This will be the case only if 

c 2 n = 2(cn — l) 2 


and hence if 

( 1 . 20 ) 


c = 


\[2n 



Now, a Bayes estimator (4.1.4) does not take on values outside the convex hull 
of the range of the estimand, which in the present case is (—1, 1). This rules out 
the minus sign in the denominator of c. Substituting (1.20) with the plus sign into 
(1.19) reduces these two equations to the single equation 


( 1 . 21 ) 


Pi + P2 = I- 


The hoped-for minimax estimator is thus 


( 1 . 22 ) 


y/2n 

S(X, Y) = - y (F - X). 

2n + 1) 


We have shown (and it is easily verified directly, see Problem 1.19) that in the 
(p 1 , pi) plane, the risk of this estimator takes on its maximum value at all points 
of the line segment (1.21), with 0 < p\ < 1, which therefore is the conjectured o> 
of Corollary 1.6. 

It remains to show that (1.22) is the Bayes estimator of a prior distribution A, 
which assigns probability 1 to the set (1.21). 

Let us now confine attention to this subset and note that p\ + p 2 = 1 implies 
p 2 — pi = 2p 2 — 1 . The following lemma reduces the problem of estimating 2p 2 — 1 
to that of estimating p 2 . 
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Lemma 1.10 Let 8 be a Bayes (respectively, UMVU, minimax, admissible) esti¬ 
mator of g(6) for squared error loss. Then, a8 + b is Bayes (respectively, UMVU, 
minimax, admissible) for ag(0) + b. 

Proof. This follows immediately from the fact that 

R(ag(d) + b,aS + b) = a 2 R(g(6), 8). 

□ 

For estimating p 2 , we have, in the present case, n binomial trials with parameter 
p = p 2 and n binomial trials with parameter p = p\ = 1 — P 2 . If we interchange 
the meanings of “success” and “failure” in the latter n trials, we have 2 n binomial 
trials with success probability P 2 , resulting in Y + (n — X) successes. According 
to Example 1.7, the estimator 

Y + n-X 72 n 1 1 

2 n l + sFh\ 2 1 + sphx 

is unique Bayes for p 2 - Applying Lemma 1.10 and collecting terms, we see that 
the estimator (1.22) is unique Bayes for estimating P 2 — p\ = 2p2 — 1 on o>. 
It now follows from the properties of this estimator and Corollary 1.5 that 8 is 
minimax for estimating p 2 — P\ ■ It is interesting that <)(X. Y) is not the difference 
of the minimax estimators for and p\. This is unlike the behavior of UMVU 
estimators. 

That S(X, Y) is the unique Bayes (and hence minimax) estimator for p 2 — p\, 
even when attention is not restricted to w, follows from the remark after Corollary 
4.1.4. It is only necessary to observe that the subsets of the sample space which 
have positive probability are the same whether (p\, /? 2 ) is in a> or not. 

The comparison of the minimax estimator (1.22) with the UMVU estimator 
( Y — X)/n gives results similar to those in the case of a single p. In particular, the 
UMVU estimator is again much better for large m = n (Problem 1.20). I 

Equation (1.4) implies that a least favorable distribution exists. When such a 
distribution does not exist. Theorem 1.4 is not applicable. Consider, for example, 
the problem of estimating the mean 9 of a normal distribution with known variance. 
Since all possible values of 6 play a completely symmetrical role, in the sense that 
none is easier to estimate than any other, it is natural to conjecture that the least 
favorable distribution is “uniform” on the real line, that is, that the least favorable 
distribution is Lebesgue measure. This is the Jeffreys prior and, in this case, is not 
a proper distribution. 

There are two ways in which the approach of Theorem 1.4 can be generalized 
to include such improper priors. 

(a) As was seen in Section 4.1, it may turn out that the posterior distribution given 
x is a proper distribution. One can then compute the expectation £[g(0)|x] 
for this distribution, a generalized Bayes estimator, and hope that it is the 
desired estimator. This approach is discussed, for example, by Sacks (1963), 
Brown (1971), and Berger and Srinivasan (1978). 

(b) Alternatively, one can approximate the improper prior distribution with a se¬ 
quence of proper distributions; for example, Lebesgue measure by the uniform 
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distributions on (-N, N ), N = 1,2,..., and generalize the concept of least 
favorable distribution to that of least favorable sequence. We shall here follow 
the second approach. 

Definition 1.11 A sequence of prior distributions {A„} is least favorable if for 
every prior distribution A we have 

(1.23) r A < r = lim r A , 

n—^oo n 

where 

(1.24) r A „ =j R(9.8 n )dA n (9) 
is the Bayes risk under A„. 

Theorem 1.12 Suppose that {A„} is a sequence of prior distributions with Bayes 
risks r n satisfying (1.23) and that 8 is an estimator for which 

(1.25) sup R(9, 8) = r. 

e 


Then 

(i) 8 is minimax and 

(ii) the sequence {A,,} is least favorable. 

Proof. 

(i) Suppose 8' is any other estimator. Then, 

sup/?(0, <5') > f R(9. 8')dA n (9) > r An , 

e J 

and this holds for every n. Hence, 

sup R(9, 8 ') > sup R(6 , 8), 

e e 


and 8 is minimax. 

(ii) If A is any distribution, then 


r~A 


/ 


R(9,8 A )dA(9) 


<-f 


R(9, 8)dA(9) < sup R(9, 8) = r. 


This completes the proof. □ 

This theorem is less satisfactory than Theorem 1.4 in two respects. First, even if the 
Bayes estimators 8„ are unique, it is not possible to conclude that 8 is the unique 
minimax estimator. The reason for this is that the second inequality in the second 
line of the proof of (i), which is strict when 8„ is unique Bayes, becomes weak 
under the limit operation. 

The other difficulty is that in order to check condition (1.25), it is necessary to 
evaluate r and hence the Bayes risk r An . This evaluation is often easy when the A„ 
are conjugate priors. Alternatively, the following lemma sometimes helps. 
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Lemma 1.13 If 8 A is the Bayes estimator of g(0) with respect to A and if 

(1.26) r A = E[S a (X) - g(©)] 2 
is its Bayes risk, then 

(1.27) r A = J vav[ 8 (@)\x\dP(x). 

In particular, if the posterior variance o/g(©)|x is independent ofx, then 

(1.28) r A = var[g(0)|x]. 


Proof The right side of (1.26) is equal to 

J {£[g(©)-£ a (x )] 2 |x } dP(x) 

and the result follows from (4.5.2). □ 


Example 1.14 Normal mean. Let X = (Xi, ..., X„), with the X ,■ iid according 
to N(6, o 2 ). Let the estimand be 6 , the loss squared error, and suppose, at first, that 
a 2 is known. We shall prove that X is minimax by finding a sequence of Bayes 
estimators S„ satisfying (1.23) with r = o 2 /n. 

As prior distribution for 6, let us try the conjugate normal distribution AT//, b 2 ). 
Then, it follows from Example 4.2.2 that the Bayes estimator is 


(1.29) 


<$A (x) 


nx/o 2 + p/b 2 
n/a 2 + l/b 2 


The posterior variance is given by (4.2.3) and is independent of x, so that 


(1.30) 


r a 


1 

n/o 2 + l/b 2 


As b —> oo, r A f o 2 /n, and this completes the proof of the fact that X is minimax. 

Suppose, now, that a 2 is unknown. It follows from the result just proved that the 
maximum risk of every estimator will be infinite unless a 2 is bounded. We shall 
therefore assume that 

(1.31) ct 2 < M. 

Under this restriction, the maximum risk of X is 


sup E(X - Q) 2 = —. 

(0,a 2 ) n 


That X is minimax subject to (1.31), then, is an immediate consequence of Lemma 
1.15 below. 

It is interesting to note that although the boundedness condition (1.31) was 
required for the minimax problem to be meaningful, the minimax estimator does 
not, in fact, depend on the value of M. 

An alternative modification, when a 2 is unknown, is to consider the loss function 
L(9, 8) = \(0 - 


(1.32) 
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For this loss function, the risk of X is bounded, and X is again minimax (Problem 

1 . 21 ). || 

We now prove a lemma which is helpful in establishing minimaxity in nonpara- 
metric situations. 

Lemma 1.15 Let X be a random quantity with distribution F, and let g(F) be a 
functional defined over a set T\ of distributions F. Suppose that 8 is a minimax 
estimator of g(F) when F is restricted to some subset To of T\. Then, if 

(1.33) sup R(F,S)= sup R(F,S), 

F^zJ-q FzlT\ 

8 is minimax also when F is permitted to vary over T\. 

Proof If an estimator 8' existed with smaller sup risk over T\ than S, it would also 
have smaller sup risk over To and thus contradict the minimax property of S over 

T 0 . □ 

Example 1.16 Nonparametric mean. Let X \,..., X n be iid with distribution F 
and finite expectation 6 , and consider the problem of estimating 9 with squared 
error loss. If the maximum risk of every estimator of 0 is infinite, the minimax 
problem is meaningless. To rule this out, we shall consider two possible restrictions 
on F: 


(a) Bounded variance. 


(1.34) vaif(Xj) < M < oo; 

(b) bounded range, 


(1.35) —oo < a < Xi < b < oo. 

Under (a), it is easy to see that X is minimax by applying Lemma 1.15 with 
T\ the family of all distributions F satisfying (1.34), and To the family of normal 
distributions satisfying ( 1.34). Then, X is minimax for To by Example 1.14. Since 
(1.33) holds with S = X, it follows that X is minimax for T\ . We shall see in the 
next section that it is, in fact, the unique minimax estimator of 9. 

To find a minimax estimator of 9 under (b), suppose without loss of generality 
that r/ = 0 and b = 1, and let T\ denote the class of distributions F with F( 1) — 
F( 0) = 1. It seems plausible in the present case that a least favorable distribution 
over T\ would concentrate on those distributions F e T\ which are as spread out 
as possible, that is, which put all their mass on the points 0 and 1. But these are 
just binomial distributions with n = 1. If this conjecture is correct, the minimax 
estimator of 9 should reduce to (1.11) when all the X, are 0 or 1, with X in (1.11) 
given by X = T,Xj. This suggests the estimator 


(1.36) 


«(*!,..., X„) = 


1 

,X+ -- 


1 


1 + yfn 2 1 + s/n 

and we shall now prove that (1.36) is, indeed, a minimax estimator of 9. 
Let To denote the set of distributions F according to which 


P(Xj = 0) = 1 - p, P{Xi = 1) = p, 0 < p < 1. 
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Then, it was seen in Example 1.7 that (1.36) is the minimax estimator of p = li(Xj) 
as F varies over To. To prove that (1.36) is minimax with respect to T \, it is, by 
Lemma 1.15, enough to prove that the risk function of the estimator (1.36) takes 
on its maximum over T t . 

Let R(F, 8) denote the risk of (1.36). Then, 


R(F, 8)= E 



1 

2(1 + y/n) 



By adding and subtracting [ A /7F/( 1 + «Jn)]0 inside the square brackets, this is seen 
to simplify to 

(1.37) R(F,S)= 1 

(1 + sjn) 2 

Now, 

var F (X) = E(X - 6) 2 = E(X 2 ) - 0 2 < E(X) - 0 2 
since 0 < X < 1 implies X 2 < X. Thus, 

(1.38) var F (X)<d-d 2 . 


var F (X) + ( - 


Substitution of (1.38) into (1.37) shows, after some simplification, that 


(1.39) 


R(F, 8) < 


1 

4(1 + JTi ) 1 ' 


Since the right side of (1.39) is the (constant) risk of <5 over To, the minimax 
property of <5 follows. j 


Let us next return to the situation, considered at the beginning of Section 3.7, of 
estimating the mean a of a population {ai ,..., a^} from a simple random sample 
Y\, ... ,Y n drawn from this population. To make the minimax estimation of a 
meaningful, restrictions on the a’s are needed. In analogy to (1.34) and (1.35), we 
shall consider the following cases: 

(a) Bounded population variance 


(1.40) 


1 

— E(fl/ — a)~ < M ; 
N 


(b) Bounded range, 

(1.41) 0 < a, < 1, 

to which the more general case a < a, < b can always be reduced. The loss 
function will be squared error, and for the time being, we shall ignore the labels. It 
will be seen in Section 5.4 that the minimax results remain valid when the labels 
are included in the data. 

Example 1.17 Simple random sampling. We begin with case (b) and consider 
first the special case in which all the values of a are either 1 or 0, say D equal to 
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1, N — D equal to 0. The total number X of l’s in the sample is then a sufficient 
statistic and has the hypergeometric distribution 


(1.42) 


P(X = X ) : 


N — D 
n — x 



where max[0, n — (N — /))] < x < min (n, D) (Problem 1.28) and where D can 
take on the values 0, 1,..., N. The estimand is a = D/N, and, following the 
method of Example 4.1.5, one finds that aX/n + /3 with 


(1.43) 


1 + 


N-n 
n(N- 1 ) 



d-a) 


is a linear estimator with constant risk (Problem 1.29). That (1.43) is minimax is 
then a consequence of the fact that it is the Bayes estimator of D/N with respect 
to the prior distribution 


(1.44) 


P(D = d ) : 


/' 


r(a)T(o) 


where 

(1.45) 


a = b = 


P 

a/n — 1 /N 


It is easily checked that as N —> oo, (1.43) -» (1.11) and (1.45) -> 1/2 «Jn, 
as one would expect since the hypergeometric distribution then tends toward the 
binomial. 

The special case just treated plays the same role as a tool for the problem of 
estimating a subject to (1.41) that the binomial case played in Example 1.16. To 
show that 


(1.46) 


8 = aY + p 


is minimax, it is only necessary to check that 


(1.47) E(8 - n) 2 = cr var(F) + [0 + (a - 1)a] 2 


takes on its maximum when all the values of a are 0 or 1, and this is seen as in 
Example 1.16 (Problem 1.31). Unfortunately, S shares the poor risk properties of 
the binomial minimax estimator for all but very small n. 

The minimax estimator of a subject to (1.40), as might be expected from Exam¬ 
ple 1.16, is Y. For a proof of this result, which will not be given here, see Bickel 
and Lehmann (1981) or Hodges and Lehmann (1981). j 


As was seen in Examples 1.7 and 1.8, minimax estimators can be quite unsat¬ 
isfactory over a large part of the parameter space. This is perhaps not surprising 
since, as a Bayes estimator with respect to a least favorable prior, a minimax esti¬ 
mator takes the most pessimistic view possible. This is illustrated by Example 1.7, 
in which the least favorable prior, B(ci n , b n ) with a„ = b n = *Jn/2, concentrates 
nearly its entire attention on the neighborhood of p = 1 /2 for which accurate esti¬ 
mation of p is most difficult. On the other hand, a Bayes estimator corresponding 
to a personal prior may expose the investigator to a very high maximum risk, which 
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may well be realized if the prior has badly misjudged the situation. It is possible 
to avoid the worst consequences of both these approaches through a compromise 
which permits the use of personal judgment and yet provides adequate protection 
against unacceptably high risks. 

Suppose that M is the maximum risk of the minimax estimator. Then, one may 
be willing to consider estimators whose maximum risk exceeds M, if the excess 
is controlled, say, if 

(1.48) R(0,&) < M(l+e) for all 6 

where s is the proportional increase in risk that one is willing to tolerate. A re¬ 
stricted Bayes estimator is then obtained by minimizing, subject to (1.48), the 
average risk (4.1.1) for the prior A of one’s choice. 

Such restricted Bayes estimators are typically quite difficult to calculate. There 
is, however, one class of situations in which the evaluation is trivial: If the maximum 
risk of the unrestricted Bayes estimator satisfies (1.48), it, of course, coincides 
with the restricted Bayes estimator. This possibility is illustrated by the following 
example. 

Example 1.18 Binomial restricted Bayes estimator. In Example 
4.1.5, suppose we believe p to be near zero (it may, for instance, be the prob¬ 
ability of a rarely occurring disease or accident). As a prior distribution for p, we 
therefore take B(\,b) with a fairly high value of b. The Bayes estimator (4.11.12) 
is then <5 = (X + 1 )/{n + b + 1) and its risk is 


(1.49) 


E(S - p) 1 = 


np( 1 - p)+ [(1 - p) - bp] 2 
[n + b+l] 2 


At p = 1, the risk is [b/(n + b + l)] 2 , which for fixed n and sufficiently large b 
can be arbitrarily close to 1, while the constant risk of the minimax estimator is 
only 1/4(1 + -JTi ) 1 . On the other hand, for fixed h, an easy calculation shows that 
(Problem 1.32). 


4(1 + \fn) 2 sup R(p, 8) —»■ 1 


as n 


oo. 


For any given b and s > 0, 8 will therefore satisfy (1.48) for sufficiently large 
values of n. j 


A quite different, and perhaps more typical, situation is illustrated by the normal 
case. 


Example 1.19 Normal. If in the situation of Example 4.2.2, without loss of gen¬ 
erality, we put o = l and p = 0, the Bayes estimator (4.2.2) reduces to cX with 
c = nb 2 /(l+nb 2 ). Since its risk function is unbounded for all n, while the minimax 
risk is 1 /«, no such Bayes estimator can be restricted Bayes. 

As a compromise, Efron and Morris (1971) propose an estimator of the form 


(1.50) 


8 = 


x + M if x < — M/(l — c) 
cx if \x\ < M/( 1 — c) 
x — M if x > M/( 1 — c) 


for 0 < c < 1. The risk of these estimators is bounded (Problem 1.33) with 
maximum risk tending toward 1 /n as M -> 0. On the other hand, for large M 
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values, (1.50) is close to the Bayes estimator. Although (1.50) is not the exact 
optimum solution of the restricted Bayes problem, Efron and Morris (1971) and 
Marazzi (1980) show it to be close to optimal. j 

2 Admissibility and Minimaxity in Exponential Families 

It was seen in Example 2.2.6 that a UMVU estimator <5 need not be admissible. If 
a biased estimator <5' has uniformly smaller risk, the choice between S and S' is not 
clear-cut: One must balance the advantage of unbiasedness against the drawback 
of larger risk. The situation is, however, different for minimax estimators. If S’ 
dominates a minimax estimator <5, then S' is also minimax and, thus, definitely 
preferred. It is, therefore, particularly important to ascertain whether a proposed 
minimax estimator is admissible. In the present section, we shall obtain some 
admissibility results (and in the process, some minimax results) for exponential 
families, and in the next section, we shall consider the corresponding problem for 
group families. 

To prove inadmissibility of an estimator <5, it is sufficient to produce an estimator 
S' which dominates it. An example was given in Lemma 2.2.7. The following is 
another instance. 

Lemma 2.1 Let the range of the estimand g(0) be an inten’al with end-points a 
and b, and suppose that the loss function L(9, d) is positive when d f g(8) and 
zero when d = g(0), and that for any fixed 9, L(9, d) is increasing as d moves 
away from g(9) in either direction. Then, any estimator S taking on values outside 
the closed inten’al [fl, b] with positive probability is inadmissible. 

Proof. S is dominated by the estimator S', which is a or b when S < a or > b, and 
which otherwise is equal to S. □ 

Example 2.2 Randomized response. The following is a survey technique some¬ 
times used when delicate questions are being asked. Suppose, for example, that 
the purpose of a survey is to estimate the proportion p of students who have ever 
cheated on an exam. Then, the following strategy may be used. With probability a 
(known), the student is asked the question “Have you ever cheated on an exam?”, 
and with probability (1 — a), the question “Have you always been honest on ex¬ 
ams?” The survey taker does not know which question the student answers, so 
the answer cannot incriminate the respondent (hence, honesty is encouraged). If a 
sample of n students is questioned in this way, the number of positive responses 
is a binomial random variable X* ~ b{p*, n ) with 

(2.1) p* = ap + (1 - o)(l - p), 
where p is the probability of cheating, and 

(2.2) min{a, 1 — a} < p* < max{a, 1 — a}. 

For estimating the probability p = [p* — (1 — a)]/(I —2a), the method of moments 
estimator p = [p* — (1 — a)]/(l — 2a) is inadmissible by Lemma 2.1. The MLE 
of p, which is equal to p if it falls in the interval specified in (2.2) and takes on 
the endpoint values if p is not in the interval, is also inadmissible, although this 
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fact does not follow directly from Lemma 2.1. (Inadmissibility of the MLE of p 
follows from Moors (1981); see also Hoeffding 1982 and Chaudhuri and Mukerjee 
1988). || 

Example 2.3 Variance components. Another application of Lemma 2.1 occurs 
in the estimation of variance components. In the one-way layout with random 
effects (see Example 3.5.1 or 4.2.7), let 

(2.3) X u = p. + A/ + nij, j = 1. n h i = 

where the variables A,- ~ N( 0, ) and Up ~ N( 0, ct 2 ) are independent. The 
parameter has range [0, oo); hence, any estimator 8 taking on negative values 
is an inadmissible estimator of a\ (against any loss function for which the risk 
function exists). The UMVU estimator of a\ [see (3.5.4)] has this property and 
hence is inadmissible. | 


A principal method for proving admissibility is the following result. 

Theorem 2.4 Any unique 1 Bayes estimator is admissible. 

Proof. If <5 is unique Bayes with respect to the prior distribution A and is dominated 
by 8', then 


J R(0,8')dA(9)< 


J R(9,8)dA(0), 


which contradicts uniqueness. 


□ 


An example is provided by the binomial minimax estimator (1.11) of Example 
1.7. For the corresponding nonparametric minimax estimator (1.36) of Example 
1.16, admissibility was proved by Hjort (1976) who showed that it is the essen¬ 
tially unique minimax estimator with respect to a class of Dirichlet-process priors 
described by Ferguson (1973). 

We shall, in the present section, illustrate a number of ideas and results concern¬ 
ing admissibility on the estimation of the mean and variance of a normal distribution 
and then indicate some of their generalizations. Unless stated otherwise, the loss 
function will be assumed to be squared error. 


Example 2.5 Admissibility of linear estimators. Let X \,..., X n be indepen¬ 
dent, each distributed according to a N(9, er 2 ), with a 1 known. In the preceding 
section, X was seen to be minimax for estimating 9. Is it admissible? Instead of 
attacking this question directly, we shall consider the admissibility of an arbitrary 
linear function aX + h. 

From Example 2.2, it follows that the unique Bayes estimator with respect to 
the normal prior for 9 with mean p and variance r 2 is 


(2.4) 


-X + 


0 0 B 

a- +nr - 


and that the associated Bayes risk is finite (Problem 2.2). It follows that aX + b is 
unique Bayes and hence admissible whenever 


1 Uniqueness here means that any two Bayes estimators differ only on a set N with Pg(N) = 0 for 

all e. 
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(2.5) 0 < a < 1. 

To see what can be said about other values of a , we shall now prove an inadmis¬ 
sibility result for linear estimators, which is quite general and in particular does 
not require the assumption of normality. 

Theorem 2.6 Let X be a random variable with mean 0 and variance a 1 . Then, 
aX + b is an inadmissible estimator of 6 under squared error loss whenever 

(i) a > 1, or 

(ii) a < 0, or 

(Hi) a = 1 and b f 0. 

Proof. The risk of aX + b is 

(2.6) p(a, b) = E(aX + b - 6) 2 = a 2 a 2 + \(a - 1)6 + b ] 2 . 

(i) If« > 1, then 

p(a, b) > a 2 o 2 > a 2 = p( 1,0) 
so that aX + b is dominated by X. 

(ii) If a < 0, then (a — 1 f > 1 and hence 

p(a , b) > [(u - 1)6 + b] 2 = (a - l) 2 

Thus, aX + b is dominated by the constant estimator 8 = —b/(a — 1). 

(iii) In this case, aX + b = X + b is dominated by X (see Lemma 2.2.7). q 

Example 2.7 Continuation of Example 2.5. Combining the results of Example 
2.5 and Theorem 2.6, we see that the estimator aX + b is admissible in the strip 
0 < a < 1 in the (a, b) plane, that it is inadmissible to the left (a < 0) and to the 
right (a > 1). 

The left boundary a = 0 corresponds to the constant estimators 8 = b which are 
admissible since 8 = b is the only estimator with zero risk at 6 = b. Finally, the 
right boundary a = 1 is inadmissible by (iii) of Theorem 2.6, with the possible 
exception of the point a = 1,6 = 0. I 

We have thus settled the admissibility of aX + b for all cases except X itself, 
which was the estimator of primary interest. In the next example, we shall prove 
that X is indeed admissible. 

Example 2.8 Admissibility of X. The admissibility of X for estimating the mean 
of a normal distribution is not only of great interest in itself but can also be regarded 
as the starting point of many other admissibility investigations. For this reason, we 
shall now give two proofs of this fact—they represent two principal methods for 
proving admissibility and are seen particularly clearly in this example because of 
its great simplicity. 




5.2] 


ADMISSIBILITY AND MINIMAXITY IN EXPONENTIAL FAMILIES 


325 


First Proof of Admissibility (the Limiting Bayes Method). Suppose that X is not 
admissible, and without loss of generality, assume that a = 1. Then, there exists 
8* such that 

* 1 

R(0, 8 ) < - for all 0, 
n 

* 1 

R(0 , 8 ) < — for at least some 0. 
n 

Now, R(0, 8) is a continuous function of 6 for every 8 so that there exists e > 0 
and 6q < 6\ such that 

* 1 

R(0, 8*) < - - e for all 0 Q < 6 < 6 ,. 
n 

Let r* be the average risk of <5* with respect to the prior distribution A T = N( 0, r 2 ), 
and let r T be the Bayes risk, that is, the average risk of the Bayes solution with 
respect to A T . Then, by (1.30) with a = 1 and r in place of b. 


n_ 

l 

n 


r * i r°° ri 

' t _ sj2nr J ~ 00 L/7 

r T 


- R(0, 5*)] £>- 02 /2r 2 d Q 

1 T 2 

n l+m 2 


n( 1 + nx 2 )s 
Xy/Tjt 


f 


e~ el ^ 2 d0. 


The integrand converges monotonically to 1 as r -> oo. By the Lebesgue mono¬ 
tone convergence theorem (TSH2, Theorem 2.2.1), the integral therefore converges 
to 0i — 0o , and, hence, as r 2 —> oo. 


Thus, there exists to such that r* o < r To , which contradicts the fact that r To is the 
Bayes risk for A Zo . This completes the proof. 

A more general version of this approach, known as Blyth’s method , will be given 
in Theorem 7.13. 


Second Proof of Admissibility (the Information Inequality Method). Another use¬ 
ful tool for establishing admissibility is based on the information inequality and 
solutions to a differential inequality, a method due to Hodges and Lehmann (1951). 
It follows from the information inequality (2.5.33) and the fact that 

R(0, 8) = E(8 - 0 ) 2 = var 0 (S) + b 2 (0). 


where b(0) is the bias of 8, that 

[1 + b'(0)] 2 , 

(2.7) R(0, 8) > --+ b 2 (0), 

nl(0) 

where the first term on the right is the information inequality variance bound for 
estimators with expected value 0 + b(0). Note that, in the present case with a 2 = 1, 
1(0) = 1 from Table 2.5.1. 
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Suppose, now, that <5 is any estimator satisfying 

1 

(2.8) R(9, 8) < — for all 9 

n 


and hence 
(2.9) 


[l + //(( 9)] 2 , 1 

--— + b 2 (0 ) < R(9, 8) < - 

n n 


We shall then show that (2.9) implies 
(2.10) b(9) = 0, 


for all 6. 


that is, that 8 is unbiased. 

(i) Since \b{6)\ < 1 /sfn, the function b is bounded. 

(ii) From the fact that 

1+2 b\9) + [b\9)f < 1, 
it follows that b'{9) < 0, so that b is nonincreasing. 

(iii) We shall show, next, that there exists a sequence of values 9, tending to oo 
and such that b'{9i) -» 0. Suppose that b'{9) were bounded away from 0 as 
9 -+ oo, say b'(9) < — s for all 9 > 9q. Then b{9) cannot be bounded as 
9 -+ oo, which contradicts (i). 

(iv) Analogously, it is seen that there exists a sequence of values 9j —> —oo and 
such that b'(0i ) -> 0 (Problem 2.3). 

Inequality (2.9) together with (iii) and (iv) shows that b{9) -» 0 as 9 -> ±oo, 
and (2.10) now follows from (ii). 

Since (2.10) implies that b{9 ) = b'(9 ) = 0 for all 9, it implies by (2.7) that 


1 

R(9, 8) > — for all 9 
n 


and hence that 


1 

R(9, 8) = 


n 

This proves that X is admissible and minimax. That it is, in fact, the only minimax 
estimator is an immediate consequence of Theorem 1.7.10. 

For another application of this second method of proof, see Problem 2.7. Ii 


Admissibility (hence, minimaxity) of X holds not only for squared error loss 
but for large classes of loss functions L(9. d) = p(d — 9). In particular, it holds if 
p(t) is nondecreasing as t moves away from 0 in either direction and satisfies the 
growth condition 

J \t\p(2\t\)<p(t)dt < oo, 

with the only exceptions being the loss functions 


p( 0) = a, p(t) = b for |r| ^0, a < b. 

This result 2 follows from Brown (1966, Theorem 2.1.1); it is also proved under 
somewhat stronger conditions in Hajek (1972). 


2 Communicated by L. Brown. 
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Example 2.9 Truncated normal mean. In Example 2.8, suppose it is known that 
9 > 9q. Then, it follows from Lemma 2.1 that X is no longer admissible. However, 
assuming that a 2 = 1 and using the method of the second proof of Example 2.8, it 
is easy to show that X continues to be minimax. If it were not, there would exist 
an estimator <5 and an e > 0 such that 

1 

R(9, 8) < - s for all 9 > 9q 

n 


and hence 


[l+b'(9)] 2 , 1 

-1- b~(9 ) <-e for all 9 > 9q. 

n n 


As a consequence, b(9) would be bounded and satisfy b'(9) < —en/2 for all 
9 > 9q, and these two statements are contradictory. 

This example provides an instance in which the minimax estimator is not unique 
and the constant risk estimator X is inadmissible. A uniformly better estimator 
which a fortiori is also minimax is max(@o, X), but it, too, is inadmissible [see 
Sacks (1963), in which a characterization of all admissible estimators is given]. 
Admissible minimax estimators in this case were found by Katz (1961) and Sacks 
(1963); see also Gupta and Rohatgi 1980. 

If 9 is further restricted to satisfy a < 9 < b, X is not only inadmissible but also 
no longer minimax. If X were minimax, the same would be true of its improvement, 
the MLE 

( a if X < a 
X if a < X < b 


b if X > b, 


so that 


sup R(d,8*)= sup R(9, X) = 

a<0<b a<6<b 


l 

n 


However, R(9, 8*) < R(9, X) = \/n for all a < 9 < b. f urthermore, R(0, 8 *) 
is a continuous function of 9 and hence takes on its maximum at some point 
a < 9 q < b. Thus, 

* * 1 

sup R(6, 8 ) = R(9 0 ,8*) < -, 

a <e<b n 

which provides a contradiction. 

It follows from Wald’s general decision theory (see Section 5.8) that in the 
present situation, there exists a probability distribution A over [a, b] which satisfies 
(1.4) and (1.6). We shall now prove that the associated set <w A of (1.5) consists of 
a finite number of points. Suppose the contrary were true. Then, &> A contains an 
infinite sequence of points with a limit point. Since R(9, 8 A ) is constant over these 
points and since it is an analytic function of 9, it follows that R(9, 8 : \) is constant, 
not only in [a, b] but for all 9. Example 2.8 then shows that <5 A = X, which is in 
contradiction to the fact that X is not minimax for the present problem. 

To simplify matters, and without losing generality [Problem 2.9(a)], we can take 
a = —m and b = w.and, thus, consider 9 to be restricted to the interval [—/«, m].To 
determine a minimax estimator, let us consider the form of a least favorable prior. 
Since A is concentrated on a finite number of points, it is reasonable to suspect 
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Figure 2.1. Risk functions of bounded mean estimators for m = 1.056742, n = 1. 



that these points would be placed at a distances neither too close together nor too 
far apart, where “close” is relative to the standard deviation of the density of X. 
(If the points are either much closer together or much further apart, then the prior 
might be giving us information.) One might therefore conjecture that the number 
of points in u >a increases with m , and for small m , look at the Bayes estimator for 
the two-point prior A that puts mass 1 /2 at ±m. 

The Bayes estimator, against squared error loss, is [Problem 2.9(b)] 

(2.11) S A (x) = m tanh(mnx) 

where tanh(-) is the hyperbolic tangent function. For m < 1.05 /+/n, Corollary 
1.6 can be used to show that S is minimax and provides a substantial risk decrease 
over x. Moreover, for m < 1 /\fn, <5 also dominates the MLE S* [Problem 2.9(c)]. 
This is illustrated in Figure 2.1, where we have taken m to be the largest value for 
which (2.11) is minimax. Note that the risk of S A is equal at 6 = 0 and 9 = m. 

As m increases, so does the number of points in &> A . The range of values of m, 
for which the associated Bayes estimators is minimax, was established by Casella 
and Strawderman (1981) for 2- and 3-point priors and Kempthorne (1988a, 1988b) 
for 4-point priors. Some interesting results concerning A and <$ A , for large m, are 
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given by Bickel (1981). An alternative estimator, the Bayes estimator against a 
uniform prior on [— m, m], was studied by Gatsonis et al. (1987) and shown to 
perform reasonably when compared to S A and to dominate <5* for \9\ < m/^/n. 
Many of these results were discovered independently by Zinzius (1981, 1982), 
who derived minimax estimators for 9 restricted to the interval [0, c], where c is 
known and small. | 

Example 2.10 Linear minimax risk. Suppose that in Example 2.9 we decide to 
restrict attention to linear estimators = uX + b because of their simplicity. 
With a 2 = 1, from the proof of Theorem 2.6 [see also Problem 4.3.12(a)], 

R(6, aX + b) = a 2 vaiX + [(a — 1)9 + b] 2 
= a 2 /n + \(a- 1)9 + b] 2 , 

and from Theorem 2.6, we only need consider 0 < a < 1. It is straightforward to 
establish (Problem 2.10) that 

max R(9, aX + b) = ma x{R(—m, aX + b), R(m , aX + b)} 

0€[—m,m ] 

and that S* = a*X , with a* = m 2 /(- + m 2 ) is minimax among linear estimators. 

Donoho et al. (1990) provide bounds on the ratio of the linear minimax risk to the 
minimax risk. They show that, surprisingly, this ratio is approximately 1.25 and, 
hence, that the linear minimax estimators may sometimes be reasonable substitutes 
for the full minimax estimators. i 

Example 2.11 Linear model. Consider the general linear model of Section 3.4 
and suppose we wish to estimate some linear function of the £’s. Without loss 
of generality, we can assume that the model is expressed in the canonical form 
(4.8) so that Y\,... ,Y„ are independent, normal, with common variance a 2 , and 
E(Yj) =/;, (/= 1,..., s)\ E(Y s+ i) = • • • = E(Y n ) = 0. The estimand can be taken 
to be ifo. If Y 2 ,..., Y n were not present, it would follow from Example 2.8 that 
Y\ is admissible for estimating q\. It is obvious from the Rao-Blackwell theorem 
(Theorem 1.7.8 ) that the presence of K v+! ...., Y„ cannot affect this result. The 
following lemma shows that, as one would expect, the same is true for Y 2 , ..., Y s . 

Lemma 2.12 Let X and Y be independent (possibly vector-valued) with distribu¬ 
tions Et and G n , respectively, where f and rj vary independently. Then, if S(X) 
is admissible for estimating § when Y is not present, it continues to be so in the 
presence ofY. 

Proof Suppose, to the contrary, that there exists an estimator T(X, Y) satisfying 
R(£,ry,T) < R(^;S) for all 
R(% 0 . )?o; T) < R(% 0 ; 5) for some § 0 , ho- 

Consider the case in which it is known that q = t)q. Then, S(X) is admissible on 
the basis of X and Y (Problem 2.11). On the other hand, 

R(^W,T)<R(^,S) for all §, 

R(i;o, i)o ; T) < R (^ 0 ; 5) for some £ 0 , 
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and this is a contradiction. 


□ 


The examples so far have been concerned with normal means. Let us now turn 
to the estimation of a normal variance. 


Example 2.13 Normal variance. Under the assumptions of Example 4.2.5, let 
us consider the admissibility, using squared error loss, of linear estimators aY + b 
of 1 /t = 2a 1 . The Bayes solutions 


( 2 . 12 ) 


a + Y 
r + g~ 1 ’ 


derived there for the prior distributions T(g, 1/a), appear to prove admissibility 
of ciY + b with 

1 

(2.13) 0 < a < -, 0 < b. 

r — 1 

In particular, this includes the estimators (1 /r)Y + b for any b > 0. On the other 
hand, it follows from (2.7) that E(Y) = r/x, so that (1 /r) Y is an unbiased estimator 
of 1 /t, and hence from Lemma 2.2.7, that (1 / r)Y+b is inadmissible for any b > 0. 
What went wrong? 

Conditions (i) and (ii) of Corollary 4.1.4 indicate two ways in which the unique¬ 
ness (hence, admissibility) of a Bayes estimator may be violated. The second of 
these clearly does not apply here since the gamma prior assigns positive density 
to all values r > 0. This leaves the first possibility as the only visible suspect. Let 
us, therefore, consider the Bayes risk of the estimator (2.12). 

Given r, we find [by adding and subtracting the expectation of Y/(g + r — 1)], 
that 

Y + a 1 \ 2 _ 1 

g + r~ 1 r) (g + r- l) 2 

The Bayes risk will therefore be finite if and only if E( 1/r 2 ) < oo, where the 
expectation is taken with respect to the prior and, hence, if and only if g > 2. 
Applying this condition to (2.12), we see that admissibility has not been proved 
for the region (2.13), as seemed the case originally, but only for the smaller region 




1 

(2.14) 0 < a < -, 0 < b. 

r + 1 

In fact, it is not difficult to prove inadmissibility for all a > 1 /(r + 1) (Problem 
2.12), whereas for a < 0 and for b < 0, it, of course, follows from Lemma 2.1. 

The left boundary a = 0 of the strip (2.14) is admissible as it was in Example 
2.5; the bottom boundary b = 0 was seen to be inadmissible for any positive 
a \/{r + 1) in Example 2.2.6. This leaves in doubt only the point a = b = 0, 
which is inadmissible (Problem 2.13), and the right boundary, corresponding to 
the estimators 

1 

(2.15) - Y + b, 0<b<oo. 

r + 1 

Admissibility of (2.15) for b = 0 was first proved by Karlin (1958), who considered 
the case of general one-parameter exponential families. His proof was extended to 
other values of b by Ping (1964) and Gupta (1966). We shall follow Ping’s proof. 
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which uses the second method of Example 2.3, whereas Karlin (1958) and Stone 
(1967) employed the first method. j 


Let X have probability density 
(2.16) pg(x) = f(9)e 6TM (9, T real-valued) 


with respect to // and let Q be the natural parameter space. Then, Q is an interval, 
with endpoints, say, 6 and 9 (—oo < (f< 9 < oo) (see Section 1.5). For estimating 
Eg(T), the estimator aT + b is inadmissible if a < 0 or a > 1 and is a constant for 
a = 0. To state Karlin’s sufficient condition in the remaining cases, it is convenient 
to write the estimator as 


(2.17) 


h,y(x) 


T + 


1 +1 


yX 
1 + X 


with 0 < X < oo corresponding to 0 < a < 1. 


Theorem 2.14 (Karlin’s Theorem) Under the above assumptions, a sufficient 
condition for the admissibility of the estimator (2.17) for estimating g(9) = Eg(T ) 
with squared error loss is that the integral of e~ yXa [f(9)]~ x diverges at 9 and 9; 
that is, that for some (and hence for all) 9_ < 9 q < 9, the two integrals 


(2.18) 


f 


-yXO 


t mv 


dO and 



e~ yXe 

imv dd 


tend to infinity as 9* tends to 9 and 9, respectively. 


Proof. It is seen from (1.5.14) and (1.5.15) that 


(2.19) 


8(0) = E e (T) = 


-P(0) 

m 


and 


(2.20) g'(9) = vareCT) = 1(9), 

where 1(9) is the Fisher information defined in (2.5.10). For any estimator MX), 
we have 


E e [8(X) - g(9)] 2 

( 2 . 21 ) 


var g [S(X)] + b 2 (0) 

> w + w +fe2(g) 


1(e) 

[1(9) + b'(9)] 2 
1(9) 


+ b 2 (0) 


[information inequality] 


where b(9) = ^[^(X)] — g(9) is the bias of <5(x). If 8 = Sx, Y of (2.17), then its 
bias is b x , Y (9) = ^[y - g(0)] with b'(9) = -y^g'(0) and 


Eg[8 x , y (X) - g(9)] 2 = Eg 


T + y X 
1 + X 



1(9) h 2 [g(9) - y] 2 

( 1+^) 2 + 


( 2 . 22 ) 


(1+A) 2 
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KO) 


+ K, r m. 


Thus, for estimators (2.17), the information inequality risk bound is an equality. 
Now, suppose that So satisfies 


(2.23) E e [S k , y (X) - g(9)} 2 > E e [S 0 (X) - g(9)] 2 for all 9. 

Denote the bias of So by bo(9), apply inequality (2.21) to the right side of (2.23), 
and apply Equation (2.22) to the left side of (2.23) to get 


(2.24) 


K. v m+ 


ine) + b'm 2 


X ’ Y "' ■ /((9) 

If h(Q) = b 0 (9) - b Xt y(6), (2.24) reduces to 


>bl(9) + 


[I(9) + b' 0 (9)] 2 

m 


, 2A 2 , [h'(9)] 2 

(2.25) h\9) - - — -h(9)[g(9) -y] + —h\9 ) + < 0, 

1 + A 1 + A I yu) 

which implies 

2X 2 

(2.26) h\9) - —h(9)(g(9) -y]+ — h '(6) < 0. 


Finally, let 

k(9) = h(9)p x (9)e yW . 

Differentiation of k(9) and use of (2.19) reduces (2.26) to (Problem 2.7) 

(2.27) K 2 (9)r\e)e- yXe + < 0. 

1 + X 

We shall now show that (2.27) with A. > 0 implies that k(9) > 0 for all 9. 
Suppose to the contrary that k(9q) < 0 for some 9q. Then, k(9) < 0 for all 9 > 9q 
since k'(9) < 0, and for 6 > 9q, we can write (2.27) as 


d 

~d0 


1 


.m 

Integrating both sides from 9q to f 
1 1 


> 


1 + A , 

> -^-p~ x (9)e~ yxe . 

leads to 

1 +A 
2 


,e* 

/ r X (0)e~ YX6 d9. 
Jo o 


As 9* -> 9, the right side tends to infinity, and this provides a contradiction since 
the left side is < — 1 /k(9q). 

Similarly, k(9) < 0 for all 9. It follows that k(9) and, hence, h(9) is zero for all 
9. This shows that for all 9 equality holds in (2.25), (2.24), and, thus, (2.23). This 
proves the admissibility of (2.17). □ 


Under some additional restrictions, it is shown by Diaconis and Ylvisaker (1979) 
that when the sufficient condition of Theorem 2.14 holds, aX + b is Bayes with 
respect to a proper prior distribution (a member of the conjugate family) and has 
finite risk. This, of course, implies that it is admissible. 
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Karlin (1958) conjectured that the sufficient condition of Theorem 2.14 is also 
necessary for the admissibility of (2.17). Despite further work on this problem 
(Morton and Raghavachari 1966, Stone 1967, Joshi 1969a), this conjecture has 
not yet been settled. See Brown (1986a) for further discussion. 

Let us now see whether Theorem 2.14 settles the admissibility of T/(r + 1), 
which was left open in Example 2.13. 

Example 2.15 Continuation of Example 2.13. The density of Example 4.2.5 is 
of the form (2.16) with 

/ —6\ r Y 

6 = -r r, p(0)=l— j , - = T(X), 0 = -oo, 6 = 0. 


Here, the parameterization is chosen so that 

E e [T(X)] = - 

X 

coincides with the estimand of Example 2.13. An estimator 
(2.28) 1 
is therefore admissible, provided the integrals 


Y i yX 

1 + + 1 + A 



_ A \ ~ r Y poo 

—J d6 = C J e yXe 6~ rX d6 


and 

e yXe 0~ rX d6 

are both infinite. 

The conditions for the first integral to be infinite are that either 



y = 0 and rX < 1, or yX > 0. 

For the second integral, the factor e yXe plays no role, and the condition is simply 


rX > 1. 


Combining these conditions, we see that the estimator (2.28) is admissible if either 


1 


(a) 

y = o 

and X = - 



r 

or 

l 


(b) 

X > - 

and y > 0 (since r > 0) 


If we put a = 1/(1 + X)r and b = yX/( 1 + X), it follows that aY + b is admissible 
if either 


(a’) 

o 

II 

and 

or 



(b’) 

b> 0 

and 
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The first of these results settles the one case that was left in doubt in Example 
2.13; the second confirms the admissibility of the interior of the strip (2.14), which 
had already been established in that example. The admissibility of Y/(r + 1) for 
estimating 1/r = 2 a 2 means that 


(2.29) 


1 

- Y = 

2 r + 2 


1 

- Y 

n + 2 


is admissible for estimating a 2 . The estimator (2.29) is the MRE estimator for a 2 
found in Section 3.3 (Example 3.3.7). || 


Example 2.16 Normal variance, unknown mean. Admissibility of the estimator 
'LX 2 /{n + 2) when the X’s are from N( 0, a 2 ) naturally raises the corresponding 
question for 

(2.30) L(X, - X) 2 /(n + 1), 


the MRE estimator of cr 2 when the Y’s are from N (%, cr 2 ) with £ unknown (Ex¬ 
ample 3.3.11). The surprising answer, due to Stein (1964), is that (2.30) is not 
admissible (see Examples 3.3.13 and 5.2.15). An estimator with uniformly smaller 
risk is 


(2.31) 


8 s = min 


LjX, - X) 2 
n + 1 


My 

n + 2 J 


The estimator (2.30) is MRE under the location-scale group, that is, among 
estimators that satisfy 


(2.32) 8(a.\\ +(?,..., ax„ + b) = a8(x i,..., x n ). 

To search for a better estimator of cr 2 than (2.30), consider the larger class of 
estimators that are only scale invariant. These are the estimators of a 2 that satisfy 

(2.32) with b = 0, and are of the form 

(2.33) 8(x,s) = (p(x/s)s 2 . 


The estimator 8 s is of this form. 

As a motivation of 8 s , suppose that it is thought a priori likely but by no means 
certain that £ = 0. One might then wish to test the hypothesis H : f = 0 by the 
usual r-test. If 


(2.34) 


\sfnX\ 

^L(X, - X) 2 /(n- 1) 


one would accept H and correspondingly estimate o 1 by LXj/(n + 2); in the 
contrary case, H would be rejected and a 2 estimated by (2.30). For the value 
c = *J(n — l)/(n + 1), it is easily checked that (2.34) is equivalent to 


(2.35) 


1 

n + 2 


LXf< 


-l-XiXi-X) 2 , 
n + 1 


and the resulting estimator then reduces to (2.31). 

While (2.30) is inadmissible, it is clear that no substantial improvement is pos¬ 
sible, since L(Xj — X) 2 /a 2 has the same distribution as L(X, — §) 2 /er 2 with n 
replaced by n — 1 so that ignorance of a 2 can be compensated for by one additional 
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observation. Rukhin (1987) shows that the maximum relative risk improvement is 
approximately 4%. i 

Let us now return to Theorem 2.14 and apply it to the binomial case as another 
illustration. 

Example 2.17 Binomial. Let X have the binomial distribution b(p. n ), which we 
shall write as 

(2.36) P(X = X ) = ( n x Vl - p)n e (x/n)nlog(.p/(l-p))_ 

Putting 6 = n log(/?/(l — p )), we have 

/ 3 ( 6 >) = (1 - p) n = [\+e 6 ' n ]-" 

and 

f X \ e e/n 

* (9)=£ “U) = f, = iT im- 

f ; u rt her mo re, as p ranges from 0 to 1, 6 ranges from 0 = —oo to 0 = +oo. 

The integral in question is then 

(2.37) J e~ yW (l + e 0/n ) Xn dO 

and the estimator X/[n{ 1 +7.)] + yl/(l + 1) is admissible, provided this integral 
diverges at both —oo and +oo. If/. < 0, the integrand is < e~ yXe and the integral 
cannot diverge at both limits, whereas for 1 = 0, the integral does diverge at both 
limits. Suppose, therefore, that 1 > 0. Near infinity, the dominating term (which 
is also a lower bound) is 

J e -yX9 + X8 de ' 

which diverges provided y < 1. At the other end, we have 

/ — c /»oo / i \kn 

e~ yXe (l + e e/ ") Xn dO = J ' e yXe M + — J dO. 

The factor in parentheses does not affect the convergence or divergence of this 
integral, which therefore diverges if and only if yX > 0. The integral will therefore 
diverge at both limits, provided 

(2.38) X > 0 and 0 < y < 1, or 1 = 0. 

With a = 1/(1 +1) and b = yX/{ 1 + 1), this condition is seen to be equivalent 
(Problem 2.7) to 

(2.39) 0 < a < 1, 0 <b, a + b< 1. 

The estimator, of course, is also admissible when a = 0 and 0 < b < 1, and it is 
easy to see that it is inadmissible for the remaining values of a and b (Problem 
2.8). The region of admissibility is, therefore, the closed triangle {(a, b) : a > 0, 
b > 0, a + b < 1}. || 

Theorem 2.14 provides a simple condition for the admissibility of T as an 
estimator of Eg(T). 
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Corollary 2.18 If the natural parameter space of (2.16) is the whole real line so 
that 9_ = —oo, 9 = oo, then T is admissible for estimating Eg(T ) with squared 
error loss. 

Proof. With k = 0 and y = 1, the two integrals (2.18) clearly tend toward infinity 
as 9 —> ±oo. □ 

The condition of this corollary is satisfied by the normal (variance known), 
binomial, and Poisson distributions, but not in the gamma or negative binomial 
case (Problem 2.25). 

The starting point of this section was the question of admissibility of some 
minimax estimators. In the opposite direction, it is sometimes possible to use the 
admissibility of an estimator to prove that it is minimax. 

Lemma 2.19 If an estimator has constant risk and is admissible, it is minimax. 

Proof. If it were not, another estimator would have smaller maximum risk and, 
hence, uniformly smaller risk. □ 

This lemma together with Corollary 2.18 yields the following minimax result. 

Corollary 2.20 Under the assumptions of Corollary 2.18, T is the unique minimax 
estimator of g(9) = Eg(T)for the loss function [d — g(0/\ 2 /vaxe(T). 

Proof. For this loss function, T is a constant risk estimator which is admissible 
by Corollary 2.18 and unique by Theorem 1.7.10. □ 

A companion to Lemma 2.19 allows us to deduce admissibility from unique 
minimaxity. 

Lemma 2.21 If an estimator is unique minimax, it is admissible. 

Proof. If it were not admissible, another estimator would dominate it in risk and, 
hence, would be minimax. □ 

Example 2.22 Binomial admissible minimax estimator. If X has the binomial 
distribution b(p, n ), then, by Corollary 2.20, X/n is the unique minimax estimator 
of p for the loss function (d — p) 2 /pq (which was seen in Example 1.7). By 
Lemma 2.21, X/n is admissible for this loss function. i 


The estimation of a normal variance with unknown mean provided a surprising 
example of a reasonable estimator which is inadmissible. We shall conclude this 
section with an example of a totally unreasonable estimator that is admissible. 


Example 2.23 Two binomials. Let X and Y be independent binomial random 
variables with distributions b(p,m) and b(jt,n), respectively. It was shown by 
Makani (1977) that a necessary and sufficient condition for 


(2.40) 


X Y 

a I - b —i-c 

m n 


to be admissible for estimating p with squared error loss is that either 


(2.41) 0 < a < 1, 0 < c < 1, 0<a + c<l, 

0 < b + c < 1, 0<a+b + c<l, 
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or 

(2.42) a = 1 and b = c = 0. 

We shall now prove the sufficiency part, which is the result of interest; for 
necessity, see Problem 2.21. 

Suppose there exists another estimator S(X, Y) with risk uniformly at least as 
small as that of (2.40), so that 

2 


Then 

(2.43) 


X Y 

E I a —I -b—¥ c — p 1 > E[8(X, Y) — pY for all p. 
m n 


P(X - x,Y = k) 


V V(fl-+/2- + 
T~f, V m n 


x=Q k=0 

m n 


> ^[«5(;c, k) - p] 2 P(X = x, Y = k). 

jt=0 k=0 

Letting n —> 0, this leads to 

m 2 m 

Y (a— + c — p) P(X = x) > y^[(5(.r, 0) — p] 2 P(X = x) 

“ V m / 

jc=0 x=0 

for all p. However, a(X/m) + c is admissible by Example 2.17; hence S(x, 0) = 
a(x/m) + c for all x = 0, 1, ..., m. 

The terms in (2.43) with k = 0, therefore, cancel. The remaining terms contain a 
common factor n which can also be canceled and one can now proceed as before. 
Continuing in this way by induction over k, one finds at the (k + l)st stage that 

2 m 


x=0 


Y { a— + b- + c — p) P(X = x) > y^[S(x, k) — p] 2 P(X = .r) 
L —' \ m 11 ' z —' 


for all p. However, aX/m + bk/n + c is admissible by Example 2.17 since 

k 

a + b —H c < 1 
n 


and, hence, 

x k 

S(x,k) = a - 1 - b —he for all x. 

m n 

This shows that (2.43) implies 


x v 

&{x, y) = a— + b— + c 
m n 


for all x and y 


and, hence, that (2.40) is admissible. 

Putting a = 0 in (2.40), we see that estimates of the form b(Y/n) + c (0 < 
c < 1,0 < b + c < 1) are admissible for estimating p despite the fact that 
only the distribution of X depends on p and that X and V are independent. This 
paradoxical result suggests that admissibility is an extremely weak property. While 
it is somewhat embarrassing for an estimator to be inadmissible, the fact that it 
is admissible in no way guarantees that it is a good or even halfway reasonable 
estimator. j 
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The result of Example 2.23 is not isolated. An exactly analogous result holds in 
the Poisson case (Problem 2.22) and a very similar one due to Brown for normal 
distributions (see Example 7.2); that an exactly analogous example is not possible 
in the normal case follows from Cohen (1965a). 

3 Admissibility and Minimaxity in Group Families 

The two preceding sections dealt with minimax estimators and their admissibility 
in exponential families. Let us now consider the corresponding problems for group 
families. As was seen in Section 3.2, in these families there typically exists an MRE 
estimator Sq for any invariant loss function, and it is a constant risk estimator. If 
<5o is also a Bayes estimator, it is minimax by Corollary 1.5 and admissible if it is 
unique Bayes. 

Recall Theorem 4.4.1, where it was shown that a Bayes estimator under an 
invariant prior is (almost) equivariant. It follows that under the assumptions of 
that theorem, there exists an almost equivariant estimator which is admissible. 
Furthermore, it turns out that under very weak additional assumptions, given any 
almost equivariant estimator 8, there exists an equivariant estimator S' which differs 
from 5 only on a fixed null set N. The existence of such a S' is obvious in the 
simplest case, that of a finite group. We shall not prove it here for more general 
groups (a precise statement and proof can be found in TSH2, Section 6.5, Theorem 
4). Since S and S' then have the same risk function, this establishes the existence 
of an equivariant estimator that is admissible. 

Theorem 4.4.1 does not require G to be transitive over Q. If we add the assump¬ 
tion of transitivity, we get a stronger result. 

Theorem 3.1 Under the conditions of Theorem 4.4.1, if G is transitive over Q, 
then the MRE estimator is admissible and minimax. 

The crucial assumption in this approach is the existence of an invariant prior 
distribution. The following example illustrates the rather trivial case in which the 
group is finite. 

Example 3.2 Finite group. Let X \,..., X n be iid according to the normal distri¬ 
bution N(f, 1). Then, the problem of estimating f with squared error loss remains 
invariant under the two-element group G, which consists of the identity transfor¬ 
mation e and the transformation 

g(x i- ,x„) = (-x l ,...,-x„); = g*d = -d. 

In the present case, any distribution A for § which is symmetric with respect to 
the origin is invariant. Under the conditions of Theorem 4.4.1, it follows that for 
any such A, there is a version of the Bayes solution which is equivariant, that is, 
which satisfies S(—x \,..., — x n ) = — 5(at, ... x„). The group G in this case is, of 
course, not transitive over 

As an example in which G is not finite, we shall consider the following version 
of the location problem on the circle. 
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Example 3.3 Circular location family. Let U\,,U n be iid on (0, 2n) accord¬ 
ing to a distribution F with density /. We shall interpret these variables as n 
points chosen at random on the unit circle according to F. Suppose that each point 
is translated on the circle by an amount 0 (0 < 0 < 2n) (i.e., the new positions 
are those obtained by rotating the circle by an amount, 0). When a value {/, + 0 
exceeds 2 tt, it is, of course, replaced by U, + 0 — 2n. The resulting values are the 
observations Xi ,..., X n . It is then easily seen (Problem 3.2) that the density of 
Xj is 

(3.1) f(Xj — 0 + 2 ji) when 0 <Xj<6, 
f(xi — 0) when 9 < Xi < 2: r. 

This can also be written as 

(3.2) f(xi — 9)1(9 < X,) + f(xj — 9 + 2n)I(xi < 9) 

where I(a < b) is 1 when a < b, and 0 otherwise. 

If we straighten the circle to a straight-line segment of length 2: r, we can also 
represent this family of distributions in the following form. Select n points at 
random on (0, 2it) according to F. Cut the line segment at an arbitrary point 9 
(0 < 9 < 2tc). Place the upper segment so that its endpoints are (0, 2it — 9) and the 
lower segment so that its endpoints are (27r — 0. 2n), and denote the coordinates 
of the n points in their new positions by X] , ..., X„. Then, the density of Xj is 
given by (3.1). 

As an illustration of how such a family of distributions might arise, suppose 
that in a study of gestation in rats, n rats are impregnated by artificial insemination 
at a given time, say at midnight on day zero. The observations are the n times 
Y\, ... ,Y n to birth, recorded as the number of days plus a fractional day. It is 
assumed that the T’s are iid according to G(y — i]) where G is known and ;/ is 
an unknown location parameter. A scientist who is interested in the time of day at 
which births occur abstracts from the data the fractional parts X J = F; — [F,]. The 
variables X, = 2nX' j have a distribution of the form (3.1) where 9 is 2it times the 
fractional part of i]. 

Let us now return to (3.2) and consider the problem of estimating 9 . The model as 
originally formulated remains invariant under rotations of the circle. To represent 
these transformations formally, consider for any real number a the unique number 
a*, 0 < a* < 2n, for which a = 2 ktt + a* (k an integer). Then, the group G of 
rotations can be represented by 

x[ = (Xi +cf, 9' = (9 + c)*, d' = (d + cf. 

A loss function IJO, d) remains invariant under G if and only if it is of the form 
L(9, d) = p[(d — 0)*] (Problem 3.3.). Typically, one would want it to depend only 
on (d — 9)** = min {(d — 9)*, (2jt — (d — 0))*}, which is the difference between d 
and 0 along the smaller of the two arcs connecting them. Thus, the loss might be 
((d — 0)**) 2 or \(d — 0)**|. It is important to notice that neither of these is convex 
(Problem 3.4). 

The group G is transitive over (2 and an invariant distribution for 0 is the uniform 
distribution over (0, 2n). By applying an obvious extension of the construction 
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(3.20) or (3.21) below, one obtains an admissible equivariant (and, hence, constant 
risk) Bayes estimator, which a fortiori is also minimax. If the loss function is not 
convex in cl, only the extension of (3.20) is available and the equivariant Bayes 
procedure may be randomized. i 


Let us next turn to the question of the admissibility and minimaxity of MRE 
estimators which are Bayes solutions with respect to improper priors. We begin 
with the location parameter case. 

Example 3.4 Location family on the line. Suppose that X = (X\X n ) has 

density 

(3.3) fix -9) = fix j - 0, - 9), 

and let G and G be the groups of translations x[ = Xj + a and 9' = 9 + a. The 

parameter space Q is the real line, and from Example 4.4.3, the invariant measure 

on £2 is the measure v which to any interval 1 assigns its length, that is, Lebesgue 

measure. 

Since the measure v is improper, we proceed as in Section 4.3 and look for a 
generalized Bayes estimator. The posterior density of 9 given x is given by 


(3.4) 


/(x - 9) 
ffi x — 9)d9' 


This quantity is non-negative and its integral, with respect to 9, is equal to 1. It 
therefore defines a proper distribution for 9, and by Section 4.3, the generalized 
Bayes estimator of 9, with loss function L, is obtained by minimizing the posterior 
expected loss 

(3.5) jL[9,8ix)]fix-9)d9/ Jfix-9)d9. 

For the case that L is squared error, the minimizing value of 8(x) is the expectation 
of 9 under (3.4), which was seen to be the Pitman estimator (3.1.28) in Exam¬ 
ple 4.4.7. The agreement of the estimator minimizing (3.5) with that obtained in 
Section 3.1 of course holds also for all other invariant loss functions. 

Up to this point, the development here is completely analogous to that of Ex¬ 
ample 3.3. However, since v is not a probability distribution. Theorem 4.4.1 is not 
applicable and we cannot conclude that the Pitman estimator is admissible or even 
minimax. j 


The minimax character of the Pitman estimator was established in the normal 
case in Example 1.14 by the use of a least favorable sequence of prior distributions. 
We shall now consider the minimax and admissibility properties of MRE estimators 
more generally in group families, beginning with the case of a general location 
family. 

Theorem 3.5 Suppose X = ( X \..... X n ) is distributed according to the density 
(3.3) and that the Pitman estimator 8* given by (1.28) has finite variance. Then, 
8 * is minimax for squared error loss. 

Proof As in Example 1.14, we shall utilize Theorem 1.12, and for this purpose, 
we require a least favorable sequence of prior distributions. In view of the discus¬ 
sion at the beginning of Example 3.4, one would expect a sequence of priors that 
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approximates Lebesgue measure to be suitable. The sequence of normal distribu¬ 
tions with variance tending toward infinity used in Example 1.14 was of this kind. 
Here, it will be more convenient to use instead a sequence of uniform densities. 


(3.6) 


7Tt(u) = 


1/2 T 

0 


if \u\ < T 
otherwise, 


with T tending to infinity. If 5y is the Bayes estimator with respect to (3.6) and ry 
its Bayes risk, the minimax character of <5* will follow if it can be shown that ry 
tends to the constant risk r* = Eq8* 2 (X ) of 8* as T —> oo. Since r T < r* for all 
T, it is enough to show 

(3.7) liminf r T > r*. 

We begin by establishing the lower bound for ry 

(3.8) ry >(l-e) inf Eo^i(X), 

a<—sT ’ 

b>sT 


where e is any number between 0 and 1, and 8 a j, is the Bayes estimator with respect 
to the uniform prior on (a, b) so that, in particular, S T = 8_ r r . Then, for any c 
(Problem 3.7), 

(3.9) S flii (x + c) = 8 a - c b- c (x) + c 

and hence 

£ e [5_y,y(X) - 0] 2 = E Q [8-T-8, T -e(X)] 2 . 

It follows that for any 0 < s < 1, 

i r T 

r-r = — j Eo[8- T -g t T-o(X)]- dd 

>(1—e) inf E'ot'S-y-e.y-elX)] 2 . 

I<?l<(i-s)r 


Since —T — 6 < —eT and T — 9 > sT when \0\ < (1 — s)T, this implies (3.8). 
Next, we show that 


(3.10) 


lim inf ry > Eq 
T—>o o 


lim inf 8 2 h (X) 

a—*■— oo ’ 

_ b—> OO 


where the lim inf on the right side is defined as the smallest limit point of all 
sequences 8 2 b (X) with a n -> —oo and b n -> oo. To see this, note that for any 
function h of two real arguments, one has (Problem 3.8). 


(3.11) 


liminf inf h(a,b) 

T—^OO a<—T 

b>T 


= liminf h(a, b). 

a—> —oo 

b—> oo 


Taking the lim inf of both sides of (3.8), and using (3.11) and Fatou’s Lemma 
(Lemma 1.2.6) proves (3.10). 

We shall, finally, show that as a —> — oo and b oo, 


5 n ,i(X) —> 5*(X) with probability 1. 


(3.12) 
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From this, it follows that the right side of (3.10) is r*, which will complete the 
proof. The limit (3.12) is seen from the fact that 

S a , h (x) = / uf(x — u)du/ 

J a 

and that, by Problems 3.1.20, and 3.1.21 the set of points x for which 

/ OO /»CO 

f(x — u)du<oo and / \u\f(x — u)du < oo 
-oo J — OO 

has probability 1. □ 

Theorem 3.5 is due to Girshick and Savage (1951), who proved it somewhat more 
generally without assuming a probability density and under the sole assumption 
that there exists an estimator (not necessarily equivariant) with finite risk. The 
streamlined proof given here is due to Peter Bickel. 

Of course, one would like to know whether the constant risk minimax estimator 
S* is admissible. This question was essentially settled by Stein (1959). We state 
without proof the following special case of his result. 

Theorem 3.6 If X\ . X„ are independently distributed with common proba¬ 

bility density /(x — 9), and if there exists an equivariant estimator Sq of 9 for 
which £'o|<5o(X)| 3 < oo, then the Pitman estimator 8* is admissible under squared 
error loss. 

It was shown by Perng (1970) that this admissibility result need not hold when 
the third-moment condition is dropped. 

In Example 3.4, we have, so far, restricted attention to squared error loss. Ad¬ 
missibility of the MRE estimator has been proved for large classes of loss functions 
by Farrell (1964), Brown (1966), and Brown and Fox (1974b). A key assumption 
is the uniqueness of the MRE estimator. An early counterexample when that as¬ 
sumption does not hold was given by Blackwell (1951). A general inadmissibility 
result in the case of nonuniqueness is due to Farrell (1964). 

Examples 3.3 and 3.4 involved a single parameter 9. That an MRE estimator of 9 
may be inadmissible in the presence of nuisance parameters, when the correspond¬ 
ing estimator of 9 with known values of the nuisance parameters is admissible, is 
illustrated by the estimator (2.30). Other examples of this type have been studied 
by Brown (1968), Zidek (1973), and Berger (1976bc), among others. An impor¬ 
tant illustration of the inadmissibility of the MRE estimator of a vector-valued 
parameter constitutes the principal subject of the next two sections. 

Even when the best equivariant estimator is not admissible, it may still be— 
and frequently is—minimax. Conditions for an MRE estimator to be minimax are 
given by Kiefer (1957) or Robert (1994a, Section 7.5). (See Note 9.3.) The general 
treatment of admissibility and minimaxity of MRE estimators is beyond the scope 
of this book. However, roughly speaking, MRE estimators will typically not be 
admissible except in the simplest situations, but they have a much better chance 
of being minimax. 

The difference can be seen by comparing Example 1.14 and the proof of Theo¬ 
rem 3.5 with the first admissibility proof of Example 2.8. If there exists an invariant 


f 


f(x — u) du 
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measure over the parameter space of the group family (or equivalently over the 
group, see Section 3.2 ) which can be suitably approximated by a sequence of 
probability distributions, one may hope that the corresponding Bayes estimators 
will tend to the MRE estimator and Theorem 3.5 will become applicable. In com¬ 
parison, the corresponding proof in Example 2.8 is much more delicate because 
it depends on the rate of convergence of the risks (this is well illustrated by the 
attempted admissibility proof at the beginning of the next section). 

As a contrast to Theorem 3.5, we shall now give some examples in which the 
MRE estimator is not minimax. 

Example 3.7 MRE not minimax. Consider once more the estimation of A in 
Example 3.2.12 with loss 1 when \d— A|/A > 1/2, and 0 otherwise. The problem 
remains invariant under the group G of transformations 

X j = ci\X\ + Q 2 X 7 , Y[ = c(a 1 Y\ + Q 2 Y 2 )> 

X' 2 = b l X l + b 2 X 2 , Y 2 = c(b\ Y\ + b 2 Y 2 ) 

with a\b 2 o 2 b\ and c > 0. The only equivariant estimator is <$(x, y) = 0 and its 
risk is 1 for all values of A. On the other hand, the risk of the estimator k*Y 2 /X\ 
obtained in Example 3.2.12 is clearly less than 1. j 

Example 3.8 A random walk. 3 Consider a walk in the plane. The walker at each 
step goes one unit either right, left, up, or down and these possibilities will be 
denoted by a, cC , b, and b~, respectively. Such a walk can be represented by a 
finite “path” such as 

bba~b~a~a~a~a~. 

In reporting a path, we shall, however, cancel any pair of successive steps which re¬ 
verse each other, such as a~a or bh . The resulting set of all finite paths constitutes 
the parameter space Q. A typical element of Q will be denoted by 

0 — 71 1 • • • 71 m , 


its length by 1(9 ) = m. Being a parameter, 6 (as well as m) is assumed to be 
unknown. What is observed is the path X obtained from 6 by adding one more 
step, which is taken in one of the four possible directions at random, that is, with 
probability 1 /4 each. If this last step is 7T m+ i, we have 


X = 


1 

I ’ ‘ ‘ Hin—1 


if j r,„ and 7 r m+1 do not cancel each other, 
otherwise. 


A special case occurs if 6 or X, after cancellation, reduce to a path of length 0; 
this happens, for example, if 0 = ci~ and the random step leading to X is a. The 
resulting path will then be denoted by e. 

The problem is to estimate 9, having observed X = x\ the loss will be 1 if the 
estimated path S(x) is ^ 9, and 0 if <$(x) = 9. 

If we observe X to be 


X — 7X\ * * * 7T£, 


3 A more formal description of this example is given in TSH2 [Chapter 1, Problem ll(ii)]. See also 
Diaconis (1988) for a general treatment of random walks on groups. 
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the natural estimate is 

<$o(y) = 7Ti - - - Uk-l- 

An exception occurs when x = e. In that case, which can arise only when 1(9) = 1, 
let us arbitrarily put So(e) = a. The estimator defined in this way clearly satisfies 

1 

R(9, Sq) < — for all 9. 

Now, consider the transformations that modify the paths 9,x, and <5 (a ) by having 
each preceded by an initial segment 7r_ r - - - tt_i on the left, so that, for example, 
9 = jti • • • 7 t m is transformed into 

§9 — 1 T— r • • • 7T—\7T\ JT m 

where, of course, some cancellations may occur. The group G is obtained by con¬ 
sidering the addition in this manner of all possible initial path segments. Equivari- 
ance of an estimator S under this group is expressed by the condition 

(3.13) S(7T-r ■ ■ ■ 7t-\x) = 7T— r ■ ■ ■ 7X-\S(x) 

for all x and all jt_, ■ • • jt_i, r = 1,2.... This implies, in particular, that 

(3.14) S(7t-r ■ ■ ■ Jt—i) = 7 T_ r • • • Jt-lS(e), 

and this condition is sufficient as well as necessary for <5 to be equivariant because 

(3.14) implies that 

71— r ■ ■ ■ 7T-\S(x) = 7t- r ■ ■ ■ Jt-ixS(e) = S(jt- r ■ ■ ■ 7T-\x). 

Since G is clearly transitive over f2, the risk function of any equivariant esti¬ 
mator is constant. Let us now determine the MRE estimator. Suppose that S(e) = 
n io • • • Ttko, so that by (3.14), 

<5(x) = xn io • • • tt/co- 

The only possibility of Six) being equal to 9 occurs when jzto cancels the last 
element of x. The best choice for k is clearly k = 1, and the choice of 7Tio (fixed 
or random) is then immaterial; in any case, the probability of cancellation with 
the last element of X is 1 /4, so that the risk of the MRE estimator (which is not 
unique) is 3/4. Comparison with <$o shows that a best equivariant estimator in this 
case is not only not admissible but not even minimax. i 

The following example, in which the MRE estimator is again not minimax but 
where G is simply the group of translations on the real line, is due to Blackwell 
and Girshick (1954). 

Example 3.9 Discrete location family. Let X = U + 9 where U takes on the 
values 1,2 ,... with probabilities 

P(U = k) = p k . 

We observe x and wish to estimate 9 with loss function 

(3.15) L(9, d) = d — 9 if d>9 

= 0 if d<9. 
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The problem remains invariant under arbitrary translation of X , 0, and cl by the 
same amount. It follows from Section 3.1 that the only equivariant estimators are 
those of the form X — c. The risk of such an estimator, which is constant, is given 
by 

(3.16) J2 (k ~ 

k>c 

If the pi■. tend to 0 sufficiently slowly, an equivariant estimator will have infinite 
risk. This is the case, for example, when 


(3.17) 


1 

Pk ~ k(k + 1) 


(Problem 3.11). The reason is that there is a relatively large probability of sub¬ 
stantially overestimating 6 for which there is a heavy penalty. This suggests a 
deliberate policy of grossly underestimating 6, for which, by (3.15), there is no 
penalty. One possible such estimator (which, of course, is not equivariant) is 


(3.18) 


8{x) = x — M\x\, M > 1, 


and it is not hard to show that its maximum risk is finite (Problem 3.12). 


The ideas of the present section have relevance beyond the transitive case for 
which they were discussed so far. If G is not transitive, we can no longer ask 
whether the uniform minimum risk equivariant (UMRE) estimator is minimax 
since a UMRE estimator will then typically not exist. Instead, we can ask whether 
there exists a minimax estimator which is equivariant. Similarly, the question of 
the admissibility of the UMRE estimator can be rephrased by asking whether an 
estimator which is admissible among equivariant estimators is also admissible 
within the class of all estimators. 

The conditions for affirmative answers to these two questions are essentially 
the same as in the transitive case. In particular, the answer to both questions is 
affirmative when G is finite. A proof along the lines of Theorem 4.1 is possible 
but not very convenient because it would require a characterization of all admis¬ 
sible (within the class of equivariant estimators) equivariant estimators as Bayes 
solutions with respect to invariant prior distributions. Instead, we shall utilize the 
fact that for every estimator <5, there exists an equivariant estimator whose average 
risk (to be defined below) is no worse than that of <5. 

Let the elements of the finite group G be gi , ..., and consider the estimators 

(3.19) 8 i(x) = g*- l 8{ gi x). 

When 8 is equivariant, of course, Sj(x) = 8 (x) for all i. Consider the randomized 
estimator 8 * for which 

(3.20) 8 *(x) = Sj(x) with probability 1 /N for each i = 1,..., N, 
and assuming the set T> of possible decisions to be convex, the estimator 

1 N 

S**(x) = - £>(*) 


(3.21) 
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which, for given x, is the expected value of 8*(x). Then, 8**(x) is equivariant, and 
so is i5*(y) in the sense that g* -1 <5*(gx) again is equal to S,(x) with probability 
l/N for each i (Problem 3.13). For these two estimators, it is easy to prove that 
(Problem 3.14): 

(i) for any loss function L, 


(3.22) R(e,S*)=^R(g i 9,S) 

and 


(ii) for any loss function L(9, d) which is convex in d, 

(3.23) R(9,8**)< ^XR(gi9,8). 

From (3.22) and (3.23), it follows immediately that 

sup R(9, 8*) < sup R(9, 8) and sup R(9, 5**) < sup R(9, 8), 

which proves the existence of an equivariant minimax estimator provided a mini¬ 
max estimator exists. 

Suppose, next, that 5o is admissible among all equivariant estimators. If <5o is 
not admissible within the class of all estimators, it is dominated by some 8. Let 8 * 
and 8 ** be as above. Then, (3.22) and (3.23) imply that 8 * and <5** dominate 8 q, 
which is a contradiction. 

Of the two constructions, 8 ** has the advantage of not requiring randomization, 
whereas 8* has the advantage of greater generality since it does not require L to 
be convex. Both constructions easily generalize to groups that admit an invariant 
measure which is finite (Problems 4.4.12—4.4.14j. Further exploration of the re¬ 
lationship of equivariance to admissibility and the minimax property leads to the 
Hunt-Stein theorem (see Notes 9.3). 


4 Simultaneous Estimation 

So far, we have been concerned with the estimation of a single real-valued param¬ 
eter g(9). However, one may wish to estimate several parameters simultaneously, 
for example, several physiological constants of a patient, several quality charac¬ 
teristics of an industrial or agricultural product, or several dimensions of musical 
ability. One is then dealing with a vector-valued estimand 

g(0)=[ gl (0),---,gr(O)] 

and a vector-valued estimator 


S = (8 u ...,8 r ). 

A natural generalization of squared error as a measure of accuracy is 

(4.1) E[«i- ft (0)] 2 , 


a sum of squared error losses, which we shall often simply call squared error loss. 
More generally, we shall consider loss functions L{0 , 8) where 8 = (8 \, , S r ), 
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and then denote the risk of an estimator S by 

(4.2) R(0,8) = E e L[0,8(X)]. 

Another generalization of expected squared error loss is the matrix R(0. 8) 
whose (/, j)th element is 

(4.3) £P;(X) - ft(ff)][« 7 -(X) - gj (0)]}. 

We shall say that 8 is more concentrated about g(0) than 8 ' if 

(4.4) 11(0, 8’)-11(0, 8) 

is positive semidefinite (but not identically zero). This definition differs from that 
based on (4.2) by providing only a partial ordering of estimators, since (4.4) may 
be neither positive nor negative semidefinite. 

Lemma 4.1 

(i) 8 is more concentrated about g(0) than 8' if and only if 

(4.5) E{£kP,(X) - gi (0)]f < E^kfSfX) - gi (0)}} 2 

for all constants k\ . k r . 

(ii) In particular, if 8 is more concentrated about g(0) than 8', then 

(4.6) ElSfX) - gi (0)] 2 < E$(X) - gi (0 )] 2 for all i. 

(Hi) If R(0, 5) < R(0, S') for all convex loss functions, then 8 is more concentrated 
about g(0) than 8'. 

Proof 

(i) If E{Tiki [<5,-(X) — gi(0)]} 2 is expressed as a quadratic form in the k t , its matrix 
is 1Z(0, 8). 

(ii) This is a special case of (i). 

(iii) This follows from the fact that {Ek/[d,- — g,(0)]} 2 is a convex function of 
d = (di,.. ,,d r ). 

□ 

Let us now consider the extension of some of the earlier theory to the case of 
simultaneous estimation of several parameters. 

(1) The Rao-Blackwell theorem (Theorem 1.7.8). The proof of this theorem 
shows that its results remain valid when <5 and g are vector-valued. In particular, 
for any convex loss function, the risk of any estimator is reduced by taking its 
expectation given a sufficient statistic. It follows that for such loss functions, one 
can dispense with randomized estimators. Also, Lemma4.1 shows that an estimator 
8 is always less concentrated about g(0) than the expectation of 8 (X), given a 
sufficient statistic. 

(2) Unbiased estimation. In the vector-valued case, an estimator <5 of g(0) is said 
to be unbiased if 

(4.7) 


Eg [5,- (X)] = g, (0) for all i and 0. 
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For unbiased estimators, the concentration matrix 1Z defined by (4.3) is just the 
covariance matrix of 8 . 

From the Rao-Blackwell theorem, it follows, as in Theorem 2.1.11 for the case 
r = 1, that if L is convex and if a complete sufficient statistic T exists, then any U- 
estimable g has a unique unbiased estimator depending only on T. This estimator 
uniformly minimizes the risk among all unbiased estimators and, thus, is also more 
concentrated about g(0 ) than any other unbiased estimator. 

(3) Equivariant estimation. The definitions and concepts of Section 3.2 apply 
without changes. They are illustrated by the following example, which will be 
considered in more detail later in the section. 

Example 4.2 Several normal means. Let X = (X), ..., X r ), with the X, inde¬ 
pendently distributed as N(6j , 1), and consider the problem of estimating the vector 
mean 0 = (0\,..., 9 r ) with squared error loss. This problem remains invariant un¬ 
der the group G i of translations 

gX = (X\ + (i ] . . . . , Xy + Gy). 

(4.8) g0 = (6i+ a\,... ,6 r + a,), 
g*d = (d\ + a\..... d r + a r ). 

The only equivariant estimators are those of the form 

(4.9) S(X) = (Xj + Cl, . . . , Xy + Cy) 

and an easy generalization of Example 3.1.16 shows that X is the MRE estimator 
of 0 . 

The problem also remains invariant under the group Go of orthogonal transfor¬ 
mations 

(4.10) ,?x = xr, g0 = 0 r, g *d = dr 

where V is an orthogonal r x r matrix. An estimator 8 is equivariant if and only if 
it is of the form (Problem 4.1) 

(4.11) <5(X) = m(X) • X, 
where m(X) is any scalar satisfying 

(4.12) i((XD = u(X) for all orthogonal T and all X 

and, hence, is an arbitrary function of L Xj (Problem 4.2). The group G defined 
by (4.10) is not transitive over the parameter space, and a UMRE estimator of 0, 
therefore, cannot be expected. i 

(4) Bayes estimators. The following result frequently makes it possible to reduce 
Bayes estimation of a vector-valued estimand to that of its components. 

Lemma 4.3 Suppose that 8f(X) is the Bayes estimator of g,(0) when 0 has the 
prior distribution A and the loss is squared error. Then, 8* = (8*, , 8*) is more 

concentrated about g(0) in the Bayes sense that it minimizes 

E[E*,(«i(X) - gi (0))] 2 = £[£M;(X) - Tk igi (0)] 2 


(4.13) 
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for all kj, where the expectation is taken over both 6 and X. 

Proof. The result follows from the fact that the estimator Y.k/SfX) minimizing 
(4.13) is 

EVLkigm |X] = Efc/EtsCfl.OIX] = '£k i S*(X). 

□ 


Example 4.4 Multinomial Bayes. Let X = (Xo,..., X s ) have the multinomial 
distribution M(n\ po ,.... p s ), and consider the Bayes estimation of the vector 
p = (po,..., p s ) when the prior distribution of p is the Dirichlet distribution A 
with density 


(4.14) 


r(flp, . . . , a s ) ao _i a> _i 
Ha,,) ■ ■ ■ rfe/ 0 Ps 


(fl,- >0, 0 < p, < 1, Ypi = 1). 


The Bayes estimator of /?, for squared error loss is (Problem 4.3) 


(4.15) 


S,(X) = 


a, + X, 
Haj + n ’ 


and by Lemma 4.3, the estimator [<5o(X),..., S s (X)] is then most concentrated in 
the Bayes sense. As a check, note that E(5, (X) = 1 as, of course, it must since A 
assigns probability 1 to Ep, = 1. j 


(5) Minimax estimators. In generalization of the binomial minimax problem 
treated in Example 1.7, let us now determine the minimax estimator of (po , ..., p s ) 
for the multinomial model of Example 4.4. 


Example 4.5 Multinomial minimax. Suppose the loss function is 
squared error. In light of Example 1.7, one might guess that a least favorable 
distribution is the Dirichlet distribution (4.14) with a$ = ■ ■ ■ = a s = a. The Bayes 
estimator (4.15) reduces to 


(4.16) 


<S*(X) 


a + Xi 
(s + l)fl + n 


The estimator <5(X) with components (4.16) has constant risk over the support of 
(4.14), provided a = n/(s + 1), and for this value of a, S(X) is therefore minimax 
by Corollary 1.5. [Various versions of this problem are discussed by Steinhaus 
(1957), Trybula (1958), Rutkowska (1977), and Olkin and Sobel (1979).] || 


Example 4.6 Independent experiments. Suppose the components X, of X = 
(X\, ..., X,) are independently distributed according to distributions P , h , where 
the 9i vary independently over C,, so that the parameter space for 6 = (0\,, 0 r ) 
is L? = Q| x • • • x Q r . Suppose, further, that for the ith component problem of 
estimating 0; with squared error loss. A/ is least favorable for 0j, and the minimax 
estimator S ,• is the Bayes solution with respect to A,-, satisfying condition (1.5) with 
&>, = a>A r Then, S = (Si,.... S r ) is minimax for estimating 0 with squared error 
loss. This follows from the facts that (i) 8 is a Bayes estimator with respect to the 
prior distribution A for 6 , according to which the components 0, are independently 
distributed with distribution A,-, (ii) A(&>) = 1 where co = a>i x • • • x co r , and (iii) 
the set of points 6 at which R(6 ,5) attains its maximum is exactly a>. 
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The analogous result holds if the component minimax estimators 5,- are not Bayes 
solutions with respect to least favorable priors but have been obtained through 
a least favorable sequence by Theorem 1.12. As an example, suppose that X, 
(i = l,..., r) are independently distributed as /V((9,-, 1). Then, it follows that 
( Xi , ..., X r ) is minimax for estimating {6\, ... , 6 r ) with squared error loss. || 

The extensions so far have brought no great surprises. The results for general r 
were fairly straightforward generalizations of those for r = 1. This will no longer 
always be the case for the last topic to be considered. 


(6) Admissibility. The multinomial minimax estimator (4.16) was seen to be a 
unique Bayes estimator and, hence, is admissible. To investigate the admissibility 
of the minimax estimator X for the case of r normal means considered at the end 
of Example 4.6, one might try the argument suggested following Theorem 4.1. It 
was seen in Example 4.2 that the problem under consideration remains invariant 
under the group G \ of translations and the group G 2 of orthogonal transformations, 
given by (4.8) and (4.10), respectively. Of these, G 1 is transitive; if there existed 
an invariant probability distribution over G 1 , the remark following Theorem 4.1 
would lead to an admissible estimator, hopefully X. However, the measures cv, 
where v is Lebesgue measure, are the only invariant measures (Problem 4.14) and 
they are not finite. Let us instead consider G 2 . An invariant probability distribution 
over G 2 does exist (TSH2, Example 6 of Chapter 9). However, the approach now 
fails because G 2 is not transitive. Equivariant estimators do not necessarily have 
constant risk and, in fact, in the present case, a UMRE estimator does not exist 
(Strawderman 1971). 

Since neither of these two attempts works, let us try the limiting Bayes method 
(Example 2.8, first proof) instead, which was successful in the case r = 1. For the 
sake of convenience, we shall take the loss to be the average squared error, 

(4.i7) L{e,d) = -T,{d i -e i f. 

r 

If X is not admissible, there exists an estimator <$*, a number e > 0, and intervals 
(dio, On) such that 


R(0.8*) 


< 1 

< 1 — e 


for all 0 

for 0 satisfying 0,o < 0, < 6n for all i. 


A computation analogous to that of Example 2.8 now shows that 


(4.18) 


1 - r* e(l + r 2 ) f 6 " f e ' 1 

1 — r r (s/2nr) r Je w Je r0 


exp(—XP 2 /2r 2 )i/Pi • • -d0 r . 


Unfortunately, the factor preceding the integral no longer tends to infinity when 
r > 1, and so this proof breaks down too. 

It was shown by Stein (1956b) that X is, in fact, no longer admissible when 
r > 3 although admissibility continues to hold for r = 2. (A limiting Bayes proof 
will work for r = 2, although not with normal priors. See Problem 4.5). For r > 3, 
there are many different estimators whose risk is uniformly less than that of X. 
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To produce an improved estimator, Stein (1956b) gave a “large r and \0\” ar¬ 
gument based on the observation that with high probability, the true 0 is in the 
sphere {0 : \0\ 2 < |x| 2 - r}. Since the usual estimator X is approximately the 
same size as 0, it will almost certainly be outside of this sphere. Thus, we should 
cut down the estimator X to bring it inside the sphere. Stein argues that X should 
be cut down by a factor of (|X| 2 — r)/|X| 2 = 1 — r/|X| 2 , and as a more general 
form, he considers the class of estimators 

(4.19) S(x) = [1 - /z(|x| 2 )]x, 
with particular emphasis on the special case 

(4.20) *»=(l-^p)>. 

See Problem 4.6 for details. 

Later, James and Stein (1961) established the complete dominance of (4.20) over 
X, and (4.20) remains the basic underlying form of almost all improved estimators. 
In particular, the appearance of the squared term in the shrinkage factor is essential 
for optimality (Brown 1971; Berger 1976a; Berger 1985, Section 8.9.4). 

Since Stein (1956b) and James and Stein (1961), the proof of domination of the 
estimator (4.20) over the maximum likelihood estimator, X, has undergone many 
modifications and updates. More recent proofs are based on the representation of 
Corollary 4.7.2 and can be made to apply to cases other than the normal. We defer 
treatment of this topic until Section 5.6. At present, we only make some remarks 
about the estimator (4.20) and the following modifications due to James and Stein 
(1961). Let 

(4.21) Si = m + ^1 - ^ (Xi - Hi) 

where fi = (hi, ..., Hr) are given numbers and 

(4.22) | x- / i\ = ['Z(x i -Hi) 2 ] 1/2 . 

A motivation for the general structure of the estimator (4.21) can be obtained by 
using arguments similar to the empirical Bayes arguments in Examples 4.7.7 and 
4.7.8 (see also Problems 4.7.6 and 4.7.7). Suppose, a priori, it was thought likely, 
though not certain, that 0, = Hi 0 = 1,..., r). Then, it might be reasonable first 
to test 


H : 9\ = Hi, ■ ■ •, Or = Hr 

and to estimate 0 by fi when H is accepted and by X otherwise. The best acceptance 
region has the form |x — /r| < C so that the estimator becomes 


(4.23) 


H 

x 


if |x — fi\ < C 
if |x — fi\ > C. 

A smoother approach is provided by an estimator with components of the form 
(4.24) Si = x[r(\x - H\) Xi + [1 - iKlx - fi\)]Hi 
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where i jr, instead of being two-valued as in (4.23), is a function increasing con¬ 
tinuously with i/r(0) = 0 to i/f(oo) = 1. The estimator (4.21) is of the form (4.24) 
(although with i/TO) = — oo), but the argument given above provides no expla¬ 
nation of the particular choice for i fr. We note, however, that many hierarchical 
Bayes estimators (such as given in Example 5.2) will result in estimators of this 
form. We will return to this question in Section 5.6. 

For the case of unknown ct, the estimator corresponding to (4.23) has been 
investigated by Sclove, Morris, and Radhakrishnan (1972). They show that it does 
not provide a uniform improvement over X and that its risk is uniformly greater 
than that of the corresponding James-Stein estimator. Although these so-called 
pretest estimators tend not to be optimal, they have been the subject of considerable 
research (see, for example. Sen and Saleh 1985, 1987). 

Unlike X, the estimator <5 is, of course, biased. An aspect that in some circum¬ 
stances is disconcerting is the fact that the estimator of Oj depends not only on X, 
but also on the other (independent) X’s. Do we save enough in risk to make up for 
these drawbacks? To answer this, we take a closer look at the risk function. 

Under the loss (4.17), it will be shown in Theorem 5.1 that the risk function of 
the estimator (4.21) can be written as 


r — 2 ( r — 2 \ 

(4.25) R(0,8) = 1- E 0 -- -j). 

Thus, <5 has uniformly smaller risk than the constant estimator X when r > 3, 
and, in particular, 8 is then minimax by Example 4.6. More detailed information 
can be obtained from the fact that |X — fi\ 2 has a noncentral / 2 -distribution with 
noncentrality parameter X = E((9, — //,) 2 and that, therefore, the risk function 

(4.25) is an increasing function of X. (See TSH2 Chapter 3, Lemma 2 and Chapter 
7, Problem 4 for details). The risk function tends to 1 as X -> oo, and takes on 
its minimum value at X = 0. For this value, |X — A 6 1 2 has a ^-distribution with r 
degrees of freedom, and it follows from Example 2.1 that (Problem 4.7) 

E ( --— 2 ) = —— 

V|X-/r| V r — 2 

and hence R(/t, 8) = 2/r. Particularly for large values of r, the savings over the 
risk of X (which is equal to 1 for all 0) can therefore be substantial. (See Bondar 
1987 for further discussion.) 

We thus have the surprising result that X is not only inadmissible when r > 3 
but that even substantial risk savings are possible. This is the case not only for 
squared error loss but also for a wide variety of loss functions which in a suitable 
way combine the losses resulting from the r component problems. In particular, 
Brown (1966, Theorem 3.1.1) proves that X is inadmissible for r > 3 when 
L(0 , d ) = p(d — 0), where p is a convex function satisfying, in addition to some 
mild conditions, the requirement that the r x r matrix 1Z with the (i, j )th element 


(4.26) 


E 0 


*'ax/ (X) 


is nonsingular. Here, the derivative in (4.26) is replaced by zero whenever it does 
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not exist. 

Example 4.7 A variety of loss functions. Consider the following loss functions 
Pi,P a'. 

Pi(t) = Y.Vitf (all Vi > 0); 

Pi( t) = max, I, 2 ; 

P3(t) = t 2 , 
p 4 (t)= 0 St,^ . 

All four are convex, and 1Z is nonsingular for p\ and /u but singular for p\ and p 4 
(Problem 4.8). For r > 3, it follows from Brown’s theorem that X is inadmissible 
for pi and p 2 - On the other hand, it is admissible for P 3 and p 4 (Problem 4.10). || 

Other ways in which the admissibility of X depends on the loss function are 
indicated by the following example (Brown, 1980b) in which L(0 , d) is not of the 
form p(d — 0). 

Example 4.8 Admissibility of X. Let Xj (i = 1,..., r) be independently dis¬ 
tributed as N(9i, 1) and consider the estimation of 6 with loss function 

v(9j) 0 

(4.27) L(0. d) = Y^ ( 6 >, - di) 2 . 

Ev(0j) 

Then the following results hold: 

(i) When v(t) = e kt (k 0), X is inadmissible if and only if r > 2. 

(ii) When v(t) = (1 + t 2 ) k/2 , 

(a) X is admissible for k < 1, 1 < r < (2 — k)/( 1 — k) and for k > 1, all r; 

(b) X is inadmissible for k < 1, r > (2 — k)/( 1 — k). 

Parts (i) and (ii(b)) will not be proved here. For the proof of (ii(a)), see Problem 
4.11. || 

In the formulations considered so far, the loss function in some way combines 
the losses resulting from the different component problems. Suppose, however, that 
the problems of estimating 6\,... ,0 r are quite unrelated and that it is important 
to control the error on each of them. It might then be of interest to minimize 

(4.28) max sup E(8j — 6i ) 2 . 

> L e, 

It is easy to see that X is the unique estimator minimizing (4.28) and is admissible 
from this point of view. This follows from the fact that X, is the unique estimator 
for which 

sup E(8i - Gif < 1. 

6i 

[On the other hand, it follows from Example 4.7 that X is inadmissible for r > 3 
when L(6, d) = max,•(*/,■ — d,) 2 .] 
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The performance measure (4.28) is not a risk function in the sense defined in 
Chapter 1 because it is not the expected value of some loss but the maximum of a 
number of such expectations. An interesting way of looking at such a criterion was 
proposed by Brown (1975) [see also Bock 1975, Shinozaki 1980, 1984]. Brown 
considers a family £ of loss functions L, with the thought that it is not clear which 
of these loss functions will be most appropriate. (It may not be clear how the data 
will be used, or they may be destined for multiple uses. In this connection, see also 
Rao 1977.) If 

(4.29) R L (6,S) = EgL[0,8(X)], 

Brown defines S to be admissible with respect to the class £ if there exists no S' 
such that 

Ri(0 , S') < Rl(6 , S) for all L e £ and all 0 

with strict inequality holding for at least one L = Lq and 0 = 0q. 

The argument following (4.28) shows that X is admissible when £ contains the 
r loss functions L,(0. d) = (di — Oi) 2 , i = 1...., r, and hence, in particular, when 
£ is the class of all loss functions 

r 

(4.30) C«(S, - e,) 2 , 0 < c, < oo. 

1 = 1 

On the other hand. Brown shows that if the ratios of the weights c, to each other 
are bounded, 

(4.31) cjcj < K , i,j = l,...,r, 

then no matter how large K , the estimator X is inadmissible with respect to the 
class £ of loss functions (4.30) satisfying (4.31). Similar results persist in even 
more general settings, such as when £ is not restricted to squared error loss. See 
Hwang 1985, Brown and Hwang 1989, and Problem 4.14. 

The above considerations make it clear that the choice between X and competi¬ 
tors such as (4.21) must depend on the circumstances. (In this connection, see also 
Robinson 1979a, 1979b). A more detailed discussion of some of these issues will 
be given in the next section. 

5 Shrinkage Estimators in the Normal Case 

The simultaneous consideration of a number of similar estimation problems in¬ 
volving independent variables and parameters (X,, 0/) often occurs in repetitive 
situations in which it may be reasonable to view the 0’s themselves as random 
variables. This leads to the Bayesian approach of Examples 4.7.1,4.7.7, and 4.7.8. 
In the simplest normal case, we assume, as in Example 4.7.1, that the X, are inde¬ 
pendent normal with mean 0, and variance cr 2 , and that the 0, ’s are also normal, say 
with mean £ and variance A (previously denoted by r 2 ), that is, X ~ N r (0 ■ or 2 /) 
and 0 ~ N ,•(§, AI). This model has some similarity with the Model II version of 
the one-way layout considered in Section 3.5. There, however, interest centered 
on the variances a 2 and A, while we now wish to estimate the P/’s. 
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To simplify the problem, we shall begin by assuming that a and § are known, 
say o = l and £ = 0, so that only A and 0 are unknown. The empirical Bayes 
arguments of Example 4.7.1 led to the estimator 

(5.1) 8?=(l-B) Xi , 

where B = (r — 2)/Y,xf, that is, the James-Stein estimator (4.21) with fi = 0. We 
now prove that, as previously claimed, the risk function of (5.1) is given by (4.25). 
However, we shall do so for the more general estimator (5.1) with B = c(r—2)/Y.xf 
where c is a positive constant. (The value c = 1 minimizes both the Bayes risk 
(Problem 4.7.5) and the frequentist risk among estimators of this form.) 

Theorem 5.1 Let X l: i = 1, ...,r(r > 2), be independent, with distributions 
N(6 ,, 1) and let the estimator 8 C of 6 be given by 


S c (x) = 1 - c 


Then, the risk function of 8 C , with loss function (5.17), is 


R(0,8 C )= 1 


(r - 2) 2 


c(2 — c) 
|X| 2 


Proof From Theorem 4.7.2, using the loss function (4.17), the risk of 8 C is 

1 2 r 3 

(5.4) R(0, S c ) = 1 + -E g \g(X)\ 2 --Y Eg——gi(X) 

r r / H X: 


where g,(x) = c(r — 2)x /|x| 2 and |g(x)| 2 = c 2 (r — 2) 2 /|x| 2 . Differentiation shows 
3 c(r — 2) , , 


and hence 


3 

t— gtOO ■ 

OXi 


XT' 9 ^ ^ c ( r — 2) 


|x| 2 - 2xf] 


c(r - 2) 


(r - 2), 


and substitution into (5.4) gives 


R(0, 8 C ) = 1 + -1 
r 


~c 2 (r-2Y 

IXI 2 


(r - 2) 2 

= 1 - -Eg 

r 


c(2 — c) 
IXI 2 


~c(r - 2f 
IXI 2 


Note that 


so R(0 , 8 C ) < oo only if the latter expectation is finite, which occurs when r > 3 
(see Problem 5.2). 
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From the expression (5.3) for the risk, we immediately get the following results. 

Corollary 5.2 The estimator 8 C defined by (5.2) dominates X (8 C = X when c = 0), 
provided 0 < c < 2 and r > 3. 

Proof. For these values, c(2 — c) > 0 and, hence, R(0 ,S C )< 1 for all 6. Note that 
R(0, 8 C ) =R(0. X) fore = 2. □ 

Corollary 5.3 The James-Stein estimators, which equals S c with c = 1, dominates 
all estimators S c with c f 1 . 

Proof. The factor c(2 — c) takes on its maximum value 1 if and only if c = 1. □ 

For c = 1, formula (5.3) verifies the risk formula (4.25). Since the James- 
Stein estimator dominates all estimators S c with c 4 1, one might hope that it 
is admissible. However, unfortunately, this is not the case, as is shown by the 
following theorem, which strengthens Theorem 4.7.5 by extending the comparison 
from the average (Bayes) risk to the risk function. 

Theorem 5.4 Let S be any estimator of the form (5.1) with B any strictly decreas¬ 
ing function of the Xj’s and suppose that 

(5.7) P e (B > 1) > 0. 

Then, 

R(0,8) < R(0, 8) 

where 

(5.8) f = max[(l — B ), 0]x ; . 

Proof. By (4.17), 

R(0,8) - R(0, S)=^ [Eg (sf - 8f) - 2^Eg (s, - S,j . 

To show that the expression in brackets is always > 0, calculate the expectations 
by first conditioning on It. For any value it < 1, we have <), = Sj, so it is enough 
to show that the right side is positive when conditioned on any value B = b > 1. 
Since in that case 8, = 0, it is finally enough to show that for any b > 1, 

0iE e [8i\B = b] = Of 1 - b)E e (Xj\B = b) < 0 

and hence that 0,- Eg (X,-1 B = h) > 0. Now B = b is equivalent to |X | 2 = c for some 
c and hence to Xj = c — (Xj + • • • + X 2 ). Conditioning further on X 2 , ..., X r , we 
find that 

EeiOiXi | |X | 2 = c, x 2 , • ■ •, x r ) = 0iE e (X l |X 2 = y 2 ) 

_ 0iy(e 6iy - e~ e ' y ) 
e 9, y + e~ 0 ' y 

where y = Jc — (x\ + • • • + x~). This is an increasing function of \9iy\, which is 
zero when 0\y = 0 , and this completes the proof. □ 
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Theorem 5.4 shows in particular that the James-Stein estimator (5 C with c = 1) 
is dominated by another minimax estimator. 


(5.9) 




Xj 


where (-) + indicates that the quantity in parentheses is replaced by 0 whenever it 
is negative. We shall call 

(0 + = max[(-), 0] 

the positive part of (■)• The risk functions of the ordinary and positive-part Stein 
estimators are shown in Figure 5.1. 


Figure 5.1. Risk functions of the ordinary and positive-part Stein estimators, for r=4. 



Unfortunately, it can be shown that even <5 + is inadmissible because it is not 
smooth enough to be either Bayes or limiting Bayes, as we will see in Section 
5.7. However, as suggested by Efron and Morris (1973a, Section 5), the positive- 
part estimator <5 + is difficult to dominate, and it took another twenty years until a 
dominating estimator was found by Shao and Strawderman (1994). There exist, in 
fact, many admissible minimax estimators, but they are of a more general form than 
(5.1) or (5.9). To obtain such an estimator, we state the following generalization 
of Corollary 5.2, due to Baranchik (1970) (see also Strawderman 1971, and Efron 
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and Morris 1976a). The proof is left to Problem 5.4. 

Theorem 5.5 ForX ~ N r (0.I), r > 3, and loss L(0 , d) = 
estimator of the form 


(5.10) 


Si = 


1 - c(|x|)- 


is minimax provided 

(i) 0 < c(-) < 2 and 

(ii) the function c is nondecreasing. 


yZ(di — Of 2 , an 


It is interesting to note how very different the situation for r > 3 is from 
the one-dimensional problem discussed in Sections 5.2 and 5.3. There, minimax 
estimators were unique (although recall Example 2.9); here, they constitute a rich 
collection. It follows from Theorem 5.4 that the estimators (5.10) are inadmissible 
whenever c(|x|)/|x| 2 > 1 /(r — 2) with positive probability. On the other hand, the 
family (5.10) does contain some admissible members, as is shown by the following 
example, due to Strawderman (1971). 

Example 5.6 Proper Bayes minimax. Let X, be independent normal with mean 
9i and unit variance, and suppose that the Of s are themselves random variables 
with the following two-stage prior distribution. For a fixed value of X, let the 0 ,■ be 
iid according to /V|0, '/X 1 (1 — A.)]. In addition, suppose that X itself is a random 
variable. A, with distribution A ~ (1 — a)X~ a , 0 < a < 1. We therefore have the 
hierarchical model 


X ~ N r (0, /), 

(5.11) 0 ~ N r (0, A _1 (l — A)/), 

A ~ (1 — a)X~ a , 0 < A < 1, 0 < a < 1 . 

Here, for illustration, we take a = 0 so that A has the uniform distribution U (0, 1). 

A straightforward calculation (Problem 5.5) shows that the Bayes estimator <5, 
under squared error loss (4.17), is given by (5.10) with 


(5.12) 


c(|x|) 


1 [ 2exp(-±|x| 2 ) 

- r + 2 - - - 

— 2 / 0 A'7 2 exp(—A|x| 2 /2)dA 


It follows from Problem 5.4 that £(A|x) = (r — 2)c(|x|)/|x| 2 and hence that 
c(|x|) > 0 since A < 1. On the other hand, c(|x|) < (r + 2 )/(r — 2) and hence 
c(|x|) < 2 provided r > 6. It remains to show that c(|x|) is nondecreasing or, 
equivalently, that 


f 


x r ' 2 


exp 


2 l X l 2 ^ 


■A) 


dX 


is nondecreasing in |x|. This is obvious since 0 < A < 1. 

Thus, the estimator (5.10) with c(|x|) given by (5.12) is a proper Bayes (and 
admissible) minimax estimator for r > 6. I 


Although neither the James-Stein estimator nor its positive-part version of (5.9) 
are admissible, it appears that no substantial improvements over the latter are 
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possible (see Example 7.3). We shall now turn to some generalizations of these 
estimators, where we no longer require equal variances or equal weights in the 
loss function. 

We first look at the case where the covariance matrix is no longer a 2 1, but may 
be any positive definite matrix E. Conditions for minimaxity of the estimator 

xt \ (\ c d x l 2 A 

(5.i3) 8(x)={l--^jx 

will now involve this covariance matrix. 


Theorem 5.7 For X ~ N r (0 , £) with E known, an estimator of the form (5.13) 
is minimax against the loss L(6 , 8 ) = 1 0 — <5| 2 , provided 

(i) 0 < c(|x| 2 ) < 2[tr(E)A max (E)] - 4, 

(ii) the function c(-) is nondecreasing, 

where tr(E) denotes the trace of the matrix £ and A. max (£) denotes its largest 
eigenvalue. 

Note that the covariance matrix must satisfy tr(£)/k max (£) > 2 for S to be 
different from X. If £ = I, tr(E)/A. max (E) = r, so this is the dimensionality 
restriction in another guise. Bock (1975) (see also Brown 1975) shows that X 
is unique minimax among spherically symmetric estimators if tr(E)/k max (E) < 
2. (An estimator is said to be spherically symmetric if it is equivariant under 
orthogonal transformations. Such estimators were characterized by (4.11), to which 
(5.13) is equivalent.) 

When the bound on c(-) is displayed in terms of the covariance matrix, we get 
some idea of the types of problems in which we can expect improvement from 
shrinkage estimators. The condition tr(E) > 2/, max (E) will be satisfied when the 
eigenvalues of E are not too different (see Problem 5.10). If this condition is 
not satisfied, then estimators which allow different coordinatewise shrinkage are 
needed to obtain minimaxity (see Notes 9.6). 


Proof of Theorem 5.7. The risk of <5(x) is 

R(0, 8) = E 0 [(6 - 8(X))'(0 - <5(X))] 
(5.14) = E 0 [(0 - X)'(0 - X)] 

-2 E ff \ - X) 

x z 


+ Eg 


c 2 (|X| 2 ) - 

IXI 2 


where E e (0 — X)'(0 — X) = tr(£), the trace of the matrix E, is the minimax risk 
(Problem 5.8). We can now apply integration by parts (see Problem 5.9) to write 

>(|X| 2 ) 


(5.15) R(0,8) = tx(Y.) + Eg 


IXI 


Y'VY 

(c(|X| 2 ) + 4)^A_-2trE 


c (|X| 2 ) , 

-4 Eg 1 ' X EX. 
8 |X| 2 


Since c'(-) > 0, an upper bound on R(0, 8) results by dropping the last term. We 



360 


MINIMAXITY AND ADMISSIBILITY 


[5.5 


then note [see Equation (2.6.5)] 

x'£x x'£x 

(5.16) -j-iT- = < ^max(S) 

|x| z x'x 

and the result follows. □ 

Theorem 5.5 can also be extended by considering more general loss functions 
in which the coordinates may have different weights. 

Example 5.8 Loss functions in the one-way layout. In the one-way layout we 
observe 

(5.17) X u ~ N(f,, of), j = 1,... ,rii, i = l,..., r , 

where the usual estimator of f is X, = Y, ; /«,. If the assumptions of Theorem 

5.7 are satisfied, then the estimator X = (X \,..., X, ) can be improved. If the f, ’s 
represent mean responses for different treatments, for example, crop yields from 
different fertilization treatments or mean responses from different drug therapies, 
it may be unrealistic to penalize the estimation of each coordinate by the same 
amount. In particular, if one drug is uncommonly expensive or if a fertilization 
treatment is quite difficult to apply, this could be reflected in the loss function. || 

The situation described in Example 5.8 can be generalized to 

(5.18) X ~ N(0. £), 

L(0,8 ) = (0 -8)'Q(0 -8), 

where both £ and Q are positive definite matrices. We again ask under what 
conditions the estimator (5.13) is a minimax estimator of 0. 

Before answering this question, we first note that, without loss of generality, 
we can consider one of E or Q to be the identity (see Problem 5.11). Hence, we 
take £ = / in the following theorem, whose proof is left to Problem 5.12; see also 
Problem 5.13 for a more general result. 

Theorem 5.9 LefX ~ N(0.1). An estimator of the form (5.13) is minimax against 
the loss L(0 , 8) = (0 — 8)' Q(0 — 8), provided 

(i) 0 < c(|x| 2 ) < 2[tr(©A max (£)] - 4, 

(ii) the function c(-) is nondecreasing. 

Theorem 5.9 can also be viewed as a robustness result, since we have shown 8 
to be minimax against any Q , which provides an upper bound for c(|x| 2 ) in (i). 
This is in the same spirit as the results of Brown (1975), mentioned in Section 5.5. 
(See Problem 5.14.) 

Thus far, we have been mainly concerned with one form of shrinkage estima¬ 
tor, the estimator (5.13). We shall now obtain a more general class of minimax 
estimators by writing 8 as <5(x) = x — g(x) and utilizing the resulting expression 
(5.4) for the risk of 8. As first noted by Stein, (5.4) can be combined with the 
identities derived in Section 4.3 (for the Bayes estimator in an exponential family) 
to obtain a set of sufficient conditions for minimaxity in terms of the condition of 
superharmonicity of the marginal distribution (see Section 1.7). 
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In particular, for the case of X ~ N (0. /), we can, by Corollary 3.3, write a 
Bayes estimator of 0 as 

9 9 

(5.19) E(0i |x) = — log m(x) - — log h(\) 

oxi axi 

with — 3 log h(x)/dxj = xi (Problem 5.17) so that the Bayes estimators are of the 
form 

(5.20) <5(x) = x + V log m(x). 


Theorem 5.10 IfX ~ N r (0 , I), the risk, under squared error loss, of the estimator 

(5.20) is given by 


(5.21) 


R(0, 8) = 1 + -E e 
r 


E 


£iV>n(X) 


4 

= 1 + -Eg 


r 


v 2 */MXj 

vS(X) 


where V 2 / = ^{(3 2 /3x 2 )/} is the Laplacian of f. 


Proof The /th component of the estimator (5.20) is 


<5,(x) = Xi + (V log m(x))i 

d m'Ax ) 

= Xi + — log m(x) = xt + —— 

0 Xj m (x) 

where, for simplicity of notation, we write mj (x) = (3/3 Xi)m(x). In the risk identity 
(5.4), set gj(x) = —m' i (x)/m(x) to obtain 


(5.22) R(0, 8) = l + Ee |E 


mfX) 
m (X ) J 




' 3 mfXf 
_3 Xi m(X) _ 



\rnfX)- 

i 

_ m(X) _ 

1 


where m”(x) = (3 2 /3 xf)m(x), and the second expression follows from straight¬ 
forward differentiation and gathering of terms. Finally, notice the differentiation 
identity 


(5.23) 


3 2 

3x 2 


[g(x)] 1/2 = 


2 [g(x)]!/2 


[g;(x)i 2 

4[g(x)] 3 / 2 ' 


Using (5.23), we can rewrite the risk (5.22) in the form (5.21). 


□ 


The form of the risk function (5.21) shows that the estimator (5.20) is minimax 
(provided all expectations are finite) if J/(3 2 /3x | 2 )[w(x)] l/ ' 2 < 0, and hence by 
Theorem 1.7.24 if [m(x)] 1 / 2 is superharmonic. We, therefore, have established the 
following class of minimax estimators. 

Corollary 5.11 Under the conditions of Theorem 5.10, f/'Zie { V 2 A /r/r(X) / V ,H (X)} 
< oo and [m(x)f^ 2 is a superharmonic function, then 3(x) = x + log V/n(x) is 
minimax. 
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A useful consequence of Corollary 5.11 follows from the fact that superhar- 
monicity of m(x) implies that of [m(x)] 1 / 2 (Problem 1.7.16) and is often easier to 
verify. 

Corollary 5.12 Under the conditions of Theorem 5.10 , if Eg\ Vm(X)/ m(X)| 2 < 
oo, Eg\ V 2 /«(X)//«(X)| < oo, and m(x) is a superharmonic function, then <S(x) = 
x + log Vw(x) is minimax. 


Proof. From the second expression in (5.22), we see that 
(5.24) 


2 m"(X) 

R(0,8)< 1 + Eg- ‘ 

r * J 


m (X) ’ 


which is < 1 if m(x) is superharmonic. 


□ 


Example 5.13 Superharmonicity of the marginal. For the model in (5.11) of 
Example 5.6, we have 

/77(x) oc f y} r l 2) ~ a dX 

Jo 

Y —^ffl(x) oc [ A ( - r ^ 2,_a+1 [l|x| 2 — 

tr 9 *, Jo 

r\* I 2 

/ t (r/ 2 ) ~ a+l [t-r]e~' , 2 dt. 

Jo 


and 

(5.25) 


(|x| 2 )C/2)-a+2 J Q 

Thus, a sufficient condition for m(x) to be superharmonic is 
■l*f 

(5.26) 


r\x\- 

/ t (rl2) ~ a+l [t-r]e~ tl2 dt < 0. 

Jo 


From Problem 5.18, we have 

Hx| 2 


r ixr r°° 

(5.27) / t (r/2) ~ a+l [t - r]e~ tl2 dt< t (r/2) - a+l [t - r]e~ ,/2 dt 

Jo Jo 

= F — a + 2 ) 2 (r/2) - fl+2 [-2 a + 4], 

so m(x) is superharmonic if —2 a + 4 < 0 or a > 2. For the choice a = 2, the 
Strawderman estimator is (Problem 5.5) 


S(x) = 1 - 


r — 2 


P(XJ < |x| 2 ) 

P(xl 2 < Ixl 2 ) 


x, 


which resembles the positive-part Stein estimator. Note that this estimator is not a 
proper Bayes estimator, as the prior distribution is improper. | 


There are many other characterizations of superharmonic functions that lead 
to different versions of minimax theorems. A most useful one, noticed by Stein 
(1981), is the following. 
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Corollary 5.14 Under the conditions of Theorem 5.10 and Corollary 5.11, if the 
prior it(6) is superharmonic, then <5(x) = x + V log m(x) is minimax. 

Proof. The marginal density can be written as 

(5.28) m(x) = J cp r (x — 0)7t(O) dO, 

where (j> r (x — 0) is the r-variate normal density. From Problem 1.7.16, the super- 
harmonicity of mix) follows. □ 

Example 5.15 Superharmonic prior. The hierarchical Bayes estimators of Faith 
(1978) (Problem 5.7) are based on the multivariate t prior 

( 2 \ -(a+r/2) 

It is straightforward to verify that this prior is superharmonic if a < —1, allowing 
a simple verification of minimaxity of an estimator that can only be expressed as 
an integral. 

The superharmonic condition, although sometimes difficult to verify, has often 
proved helpful in not only establishing minimaxity but also in understanding what 
types of prior distributions may lead to minimax Bayes estimators. See Note 9.7 
for further discussion. j 

We close this section with an examination of componentwise risk. For X, ~ 
N(0j, 1), independent, and risk function 

(5.30) R(0,S) = 

r 

with R(6j , 8/) = E(Sj — 0,) 2 , it was seen in Section 5.3 that it is not possible to find 
a Sj for which R(6i, Sj ) is uniformly better than R(6j, X t ) = 1. 

Thus, the improvement in the average risk can be achieved only though in¬ 
creasing some of the component risks, and it becomes of interest to consider the 
maximum possible component risk 

(5.31) max sup R(Qj, Sj). 

' 8 , 

For given X = £0?, it can be shown (Baranchik 1964) that (5.31) attains its 
maximum when all but one of the 6j ’s are zero, say O 2 = ■ ■ ■ = 9 r = 0 , 9i = Vx, 
and that this maximum risk p, (X) as a function of X increases from a minimum 
of 2/ r at X = 0 to a maximum and then decreases and tends to 1 as X —> 00 ; see 
Figure 6.2. 

The values of max> p r (X) and the value at which the maximum is attained, 
shown in Table 5.1, are given by Efron and Morris (1972a). The table suggests that 
shrinkage estimators will typically not be appropriate when the component prob¬ 
lems concern different clients. No one wants his or her blood test subjected to the 
possibility of large errors in order to improve a laboratory’s average performance. 

To get a feeling for the behavior of the James-Stein estimator (4.21) with p, = 0, 
in a situation in which most of the 0, ’s are at or near zero (representing the standard 
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Figure 5.2. Maximum component risk p r (X) of the ordinary James-Stein estimator, and the 
componentwise risk ofX, the UMVU estimator, for r=4. 


CD 



101 


Table 5.1. Maximum Component Risk 


r 

3 

5 

10 

20 

30 

oo 


2.49 

2.85 

3.62 

4.80 

5.75 


Pr(K) 

1.24 

1.71 

2.93 

5.40 

7.89 

r/4 


or normal situation of no effect) but a few relatively large 6, ’s are present, consider 
the 20-component model 

Xi ~ N(9t, 1), t = l,,,.,20, 

where the vector 6 = (Ox,, 6bo) is taken to have one of three configurations: 


(a) 

6i = 

■ • • = 019 = 0, 

020 = 2, 3, 4, 

(b) 

6i = 

o' 

II 

00 

II 

019 = i, 020 = j, 2 < i < j < 4, 

(c) 

0i = ' 

II 

P 

-j 

II 

© 

II 

o\ 

II 

OO 


020 = 

k , 2 < i < 

j < k < 4. 
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The resulting shrinkage factor, 1 — (r — 2)/|x| 2 , by which the observation is mul¬ 
tiplied to obtain the estimators <5, of Oj, has expected value 

( r- 2 \ 1 

(5.32) E 1- \ = i-( r -2)E —— 

v ixp; x, 2 a) 


where X = \f )\ 2 (see Problem 5.23). Its values are given in Table 5.2. 


Table 5.2. Expected Value of the Shrinkage Factor 
(a) 9 2 q (b) $i9#20 


e's fo 

2 

3 

4 

22 

23 

33 

24 

34 

44 

Factor 

.17 

.37 

.46 

.29 

.40 

.49 

.51 

.57 

.63 

X 

4 

9 

16 

8 

13 

18 

20 

25 

32 


(c) 6*18019020 


0 's fO 

222 

223 

224 

233 

234 

244 

333 

334 

344 

444 

Factor 

.38 

.47 

.56 

.54 

.61 

.66 

.59 

.64 

.69 

.78 

7 

12 

17 

24 

22 

29 

36 

27 

34 

41 

64 


l-(r- 2 )£ 


e~ k/ \X/2 ) k 


k ^(r+ 2 k- 2 )kl 


To see the effect of the shrinkage explicitly, suppose, for example, that the 
observation X 20 corresponding to 620 = 2 turned out to be 2.71. The modified 
estimate ranges from 2.71 x .17 = .46 (when 9\ = ■ ■ ■ = #19 = 0) to 2.71 x .66 = 
1.79 (when 0\ = ■ ■ ■ = On = 0, Big = Big = 4). 

What is seen here can be summarized roughly as follows: 

(i) If all the Bf s are at or fairly close to zero, then the James-Stein estimator will 
reduce the X’s very substantially in absolute value and thereby typically will 
greatly improve the accuracy of the estimated values. 

(ii) If there are some very large 6 , ’s or a substantial number of moderate ones, 
the factor by which the X’s are multiplied will not be very far from 1, and the 
modification will not have a great effect. 

Neither of these situations causes much of a problem: In (ii), the modification 
presents an unnecessary but not particularly harmful complication; in (i), it 
is clearly very beneficial. The danger arises in the following intermediate 
situation. 

(iii) Most of the 9f s are close to zero, but there are a few moderately large Bf s (of 
the order of two to four standard deviations, say). These represent the cases 
in which something is going on, about which we will usually want to know. 
However, in these cases, the estimated values are heavily shrunk toward the 
norm, with the resulting risk of their being found “innocent by association.’’ 

If one is interested in minimizing the average risk (5.30) but is concerned about 
the possibility of large component risks, a compromise is possible along the lines 
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of restricted Bayes estimation mentioned in Section 5.2. One can impose an upper 
bound on the maximum component risk, say 10% or 25% above the minimax 
risk of 1 (when a = 1). Subject to this restriction, one can then try to minimize 
the average risk, for example, in the sense of obtaining a Bayes or empirical 
Bayes solution. An approximation to such an approach has been developed by 
Efron and Morris (1971, 1972a), Berger (1982a, 1988b), Bickel (1983, 1984), and 
Kempthorne (1988a, 1988b). See Example 6.7 for an illustration. 

The results discussed in this and the preceding section for the simultaneous 
estimation of normal means have been extended, particularly to various members 
of the exponential family and to general location parameter families, with and 
without nuisance parameters. The next section contains a number of illustrations. 


6 Extensions 

The estimators of the previous section have all been constructed for the case of the 
estimation of 6 based on observing X ~ N r (6,1). The applicability of shrinkage 
estimation, now often referred to as Stein estimation, goes far beyond this case. In 
this section, through examples, we will try to illustrate some of the wide ranging 
applicability of the “Stein effect,” that is, the ability to improve individual estimates 
by using ensemble information. 

Also, the shrinkage estimators previously considered were designed to obtain 
the greatest risk improvement in a specified region of the parameter space. For 
example, the maximum risk improvement of (5.10) occurs at 0\ = 66 = ■ ■ ■ = 
9, = 0, while that of (4.21) occurs at 6 \ = yu-i, 62 = Hi, ..., 9 r = i-i r . In the next 
three examples, we look at modifications of Stein estimators that shrink toward 
adaptively chosen targets, that is, targets selected by the data. By doing so, it is 
hoped that a maximal risk improvement will be realized. 

Although we only touch upon the topic of selecting a shrinkage target, the 
literature is vast. See Note 9.7 for some references. 

The first two examples examine estimators that we have seen before, in the 
context of empirical Bayes analysis of variance and regression (Examples 4.7.7 
and 4.7.8). These estimators shrink toward subspaces of the parameter space rather 
than specified points. Moreover, we can allow the data to help choose the specific 
shrinkage target. We now establish minimaxity of such estimators. 

Example 6.1 Shrinking toward a common mean. In problems where it is thought 
there is some similarity between components, a reasonable choice of shrinkage 
target may be the linear subspace where all the components are equal. This was 
illustrated in Example 4.7.7, where the estimator (4.7.28) shrunk the coordinates 
toward an estimated common mean value rather than a specified point. 

For the average squared error loss L (6 , S) = ^\0 — 5| 2 , the estimator (4.7.28) 
with coordinates 

( 6.i) + <"-»• 
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has risk given by 


(6.2) 


R(6 


S L ): 


(r - 3 ) 2 
1 +- E a 


c(c - 2 ) 
Zj&j ~ X ) 2 


Hence, 8 L is minimax if r > 4 and c < 2. The minimum risk is attained at 6 values 
that satisfy — 0) 1 = 0, that is, where 0\ = 02 = ■■■= 0 r . Moreover, the best 
value of c is c = 1, which results in a minimum risk of 3/r. This is greater than 
the minimum of 2 /r ( for the case of a known value of 0) but is attained on a larger 
set. See Problems 6.1 and 6.2. | 


Example 6.2 Shrinking toward a linear subspace. The estimator S L given by 
( 6 . 1 ) shrinks toward the subspace of the parameter space defined by 


(6.3) 


C = {6 : 0 , = 0 2 : 


6 : -J0=6 
r 


where / is a matrix of l’s, J = 11'. 

Another useful submodel, which is a generalization of (6.3), is 


(6.4) 


9i = a + fit/ 


where the f, ’s are known but a and ft are unknown. This corresponds to the 
(sub)model of a linear trend in the means (see Example 4.7.8). If we define 



then the 0,-’s satisfying (6.4) constitute the subspace 

(6.6) £={0:T*6 = 6}, 

where T* = T{T'T)~ X T' is the matrix that projects any vector 6 into the sub¬ 
space. (Such projection matrices are symmetric and idempotent, that is, they satisfy 
(T*) 2 = I.) 

The models (6.3) and ( 6 . 6 ) suggest what the more general situation would look 
like when the target is a linear subspace defined by 

(6.7) Ck = {6 : K6 = 0 , K idempotent of rank s}. 

If we shrink toward the MLE of 6 e Ck, which is given by 6k = K x, the resulting 
Stein estimator is 

(6.8) «*(x) = 6k + { 1 - r ~ S 7 2 ) (x - 0 k ) 

\ l x — 0A-| 2 / 

and is minimax provided r — s > 2. (See Problem 6.3.) More general linear 
restrictions are possible: one can take C = {0 : H6 = m} where H sxr and m sx i 
are specified (see Casella and Hwang, 1987). j 


Example 6.3 Combining biased and unbiased estimators. Green and Strawder- 
man (1991) show how the Stein effect can be used to combine biased and unbiased 
estimators, and they apply their results to a problem in forestry. 
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An important attribute of a forest stand (a homogeneous group of trees) is the 
basal area per acre , B, defined as the sum of the cross-sectional areas 4.5 feet above 
the ground of all trees. Regression models exist that predict log B as a function of 
stand age, number of trees per acre, and the average height of the dominant trees 
in the stand. The average prediction, Y, from the regression is a biased estimator 
of B. Green and Strawderman investigated how to combine this estimator with X, 
the sample mean basal area from a small sample of trees, to obtain an improved 
estimator of B. They formulated the problem in the following way. 

Suppose X ~ N, (0, a 2 1 ) and Y ~ N r (0 + r 2 /), independent, where a 2 

and t 2 are known, and the loss function is L(0 , 6) = \0 — S\ 2 /ra 2 . Thus, § is an 
unknown nuisance parameter. The estimator 

( c(r — 2)er 2 \ 

(6.9) S c (x, y) = y + (l - - _ ' 2 j (x - y) 


is a minimax estimator of 0 if 0 < c < 2 , which follows from noting that 


( 6 . 10 ) 


R(0,S c )= 1 - a 2 -—— E 


c( 2 - c) 
|X-Y | 2 


and that the minimax risk is 1. If § = 0 and, hence, Y is also an unbiased estimator 
of 0 , then the optimal linear combined estimator 

(6.11) ^(x,y)=I^$ 

dominates 5'(x, y) in risk. However, the risk of <5 comb becomes unbounded as 
|£| -* oo, whereas that of <$' is bounded by 1. (See Problems 6.5 and 6 . 6 .) | 


The next example looks at the important case of unknown variance. 

Example 6.4 Unknown variance. The James-Stein estimator (4.21) 
which shrinks X toward a given point p, was obtained under the assumption that 
X ~ N(0, I). We shall now generalize this estimator to the situations, first, that 
the common variance of the A, has a known value a 2 and then that a 2 is unknown. 

In the first case, the problem can be reduced to that with unit variance by con¬ 
sidering the variables Xj/a and estimating the means 0, j a and then multiplying 
the estimator of 9, la by a to obtain an estimator of 0,-. This argument leads to 
replacing (4.21) by 

( 6 . 12 ) 8/ = ^ + ( 1 - -—-—- ) (Xj - 

V |x-/t | 2 la 2 ) 

where |x — p\ 2 = £(x, — pt) 2 , with risk function (see Problem 6.7) 

r-2 ( r — 2 V 

r Et> \\X-p\ 2 /a 2 )_' 

Suppose now that a 2 is unknown. We shall then suppose that there exists a 
random variable S 2 , independent of X and such that S 2 /a 2 ~ / 2 , and we shall 
in (6.12) replace a 2 by a 2 = kS 2 , where k is a positive constant. The estimator is 


(6.13) R(0,8) = a 2 
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then modified to 
(6.14) St = /I, + I 1 


r — 2 


+ ( 1 


(.Xi - ili) 

, 2 N 


|X - M| 2 /<7 2 

r — 2 er- 

7-' — (Xi - Mi)- 

|x — fi\ z /(J z a- 


The conditional risk of S given <r is given by (5.4) with x — fi\ 2 replaced by 
|x — fi\ 2 /cr 2 and c = a 2 /a 2 . Because of the independence of S 2 and |x — ft | 2 , we 
thus have 


(r - 2) 2 

(6.15) R(0,8)= 1 - -Eg 


r a 2 1 


S 2 2 i 

^ 2 \ 2 ] 


E 

2k^r -k 2 

— 

L ix — mi 2 J 


a 2 ' 

1 

b 


Now, E(S 2 /<j 2 ) = v and E(S 2 /<j 2 ) 2 = v(v + 2), so that the second expectation is 


(6.16) 


E 




= 2 kv — k 2 v( v + 2). 


This is positive (making the estimator minimax) if k < 2/(v + 2), and (6.16) is 
maximized at k = 1 /(v + 2) where its value is v/(v + 2). 

The best choice of k in a 2 thus leads to using the estimator 

(6.17) er 2 = S 2 /(v + 2) 


and the risk of the resulting estimator is 

v (r — 2) 2 


(6.18) 


R(0,8) = 1 


v + 2 


|X — /r | 2 ' 

The improvement in risk over X is thus reduced from that of (4.25) by a factor of 
v/(v + 2). (See Problems 6.7-6.11.) || 


For distributions other than the normal, Strawderman (1974) determined mini¬ 
max Stein estimators for the following situation. 

Example 6.5 Mixture of normals. Suppose that, given a, the vector X is dis¬ 
tributed as N(6 , a 2 1), and that it is a random variable with distribution G, so that 
the density of X is 

1 r°° 

(6.19) f(\x-0\)= y / e~^^ 2 a~ r dG(a\ 

(2n ) r / Jo 

a scale mixture of normals, including, in particular, the multivariate Student’s t- 
distribution. Since E(|X — 6 1 2 | a) = ro 2 , it follows that with loss function 
L(6 , 8) = \0 — 8\ 2 /r, the risk of the estimator X is E(a 2 ). On the other hand, the 
risk of the estimator 



is given by 

(6.21) EgL\6 , S(X)] = E a Eg la L[0,8(X)]. 
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Calculations similar to those in the proofs of Theorems 4.7.2 and 5.1 show that 


E ew L[0,8(X)\ = a 2 - - 
r 


2 c(r - 2 ) - — 
cr- 


and, hence, 


1 


(6.22) R(0,8) = E a o - E a 


c 2 1 

2 c(r - 2 )- — 


Eew ( 2 


E e\a I IV|2 


IXf 


An upper bound on the risk can be obtained from the following lemma, whose 
proof is left to Problem 6.13. 


Lemma 6.6 Let Y be a random variable, and g(y ) and h(y ) any functions for 
which E[g(Y)], E[(h(Y)\ and E(g(Y)h(Y)\ exist. Then: 

(a) If one of the functions g(-) and /?(•) is nonincreasing and the other is nonde¬ 
creasing, 


E[g(Y)h(Y)] < E[g(Y)]E[h(Y)]. 

(b) If both functions are either nondecreasing or non increasing, 

E[g(Y)h(Y )] > E[g(Y)]E[h(Y)]. 

Returning to the risk function (6.22), we see that [2c(r—2)—c 2 /er 2 ] is an increasing 
function of a 2 , and Ee\ a {o 2 /\X\ 2 ) is also an increasing function of a 2 . (This 
latter statement follows from the fact that, given a 2 , |X| 2 /cr 2 has a noncentral 
/ 2 -distribution with noncentrality parameter | 0 | 2 /er 2 , and that, therefore, as was 
pointed out following (4.25), the expectation is increasing in a 2 .) 

Therefore, by Lemma 6 . 6 , 


2 c(r - 2 ) - ^ 


> E„ 


c 21 

2c(r - 2 ) - — 
er z 


<7 

E e\a ( 

2 


o 

1X4 


Hence, <5(x) will dominate x if 

2 c(r - 2 ) - c 2 E 0 > 0 

cr z 

or 

0 < c < 2(r — 2 )/E a Xz = 2/Eo\X\~ 2 , 

<r z 

where LioIXp 2 is the expectation when 0=0 (see Problem 6.12). 

If /(|x — 0\) is the normal density N(0, /), then ZsoIXI -2 = (r — 2) _1 , and 
we are back to a familiar condition. The interesting fact is that, for a wide class 
of scale mixtures of normals, Z?o|X |~ 2 > (r — 2) -1 . This holds, for example, if 
1/er 2 ~ Xv/ V so /(l x — 0|) is multivariate Student’s t. This implies a type of 
robustness of the estimator (6.20); that is, for 0 < c < 2 (r — 2), <5(X) dominates 
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X under a multivariate /-distribution and hence retains its minimax property (see 
Problem 6.12). || 


Bayes estimators minimize the average risk under the prior, but the maximum 
of their risk functions can be large and even infinite. On the other hand, minimax 
estimators often have relatively large Bayes risk under many priors. The following 
example, due to Berger (1982a), shows how it is sometimes possible to construct 
estimators having good Bayes risk properties (with respect to a given prior), while 
at the same time being minimax. The resulting estimator is a compromise between 
a Bayes estimator and a Stein estimator. 

Example 6.7 Bayesian robustness. ForX ~ N r (0, a 2 1 ) and 0 ~ n = N r ( 0, r 2 1), 
the Bayes estimator against squared error loss is 

r 2 

(6.23) S*(x) = -jx 

a- + r 1 

with Bayes risk r(n, S n ) = ra 2 r 2 /(cr 2 + r 2 ). However, 8 71 is not minimax and, in 
fact, has unbounded risk ( Problem 4.3.12). The Stein estimator 


(6.24) 


5 f (x) 


1 — c 


(r — 2 )cr 2 \ 

|x | 2 / 


X 


is minimax if 0 < c < 2, but its Bayes risk 

4 

r(n, 8 C ) = r(n, 8*) + — - -[r + c(c - 2 )(r - 2 )], 

a- + r z 

at the best value c = 1, is r(j r, S') = r{n, 8 n ) + 2cr 4 /(a 2 + r 2 ). 

To construct a minimax estimator with small Bayes risk, consider the compro¬ 
mise estimator 


(6.25) 


S R (x) 


S n (x) if |x | 2 < c(r - 2)(a 2 + x 2 ) 
8 c {x) if |x | 2 > c(r - 2 )(cr 2 + r 2 ). 


This estimator is minimax if 0 < c < 2 (Problem 6.14). If |x | 2 > c(r — 2)(ct 2 + t 2 ), 
the data do not support the prior specification and we, therefore, put S R = 8 C \ 
if |x | 2 < c(r — 2 )(ct 2 + r 2 ), we tend to believe that the data support the prior 
specification since 2 ~ x 2 . and we are, therefore, willing to gamble on n and 
put 8 r =S n . 

The Bayes risk of <5 R is 

(6.26) r(jt, 8 r ) = E\6 - 8 n (X)| 2 / (|X | 2 < c(r - 2)(cr 2 + r 2 )) 

+ E\9 - 8 c (X)\ 2 I (|X | 2 > c(r - 2)(ct 2 + r 2 )) , 


where the expectation is over the joint distribution of X and 6. Adding ± 8 11 to the 
second term in (6.26) yields 

r(jr, 8 r ) = E\0 — S lr (X )| 2 

(6.27) +£|5 3 r (X) - <5 e (X)| 2 / (|X | 2 > c(r - 2)(cr 2 + r 2 )) 

= r( 7 T, 8 71 ) + £'|S jr (X) - 5 f (X)| 2 / (|X | 2 > c(r - 2)(cr 2 + r 2 )) . 



372 


MINIMAXITY AND ADMISSIBILITY 


[5.6 


Here, we have used the fact that, marginally, |X| 2 /(ct 2 + r 2 ) ~ / 2 . We can write 

(6.27) as (see Problem 6.14) 

r{it, S R ) = r(n, 8 n ) 

(6.28) + -'—-^E[Y - c(r - 2)] 2 I[Y > c(r - 2)], 

r — 2 a~ + x 1 


where Y ~ y 2 _ 2 . An upper bound on (6.28) is obtained by dropping the indicator 
function, which gives 


r(jt, 8 R ) < r(n, 8*) + 


1 


(6.29) 


r — 2 er 2 + r 2 
= r(7t, 8 n ) + [r + c(c - 2 )(r - 2)] 
= r(n, 8 C ). 


E[Y - c(r - 2)] 2 

4 


a 


This shows that S R has smaller Bayes risk than 8 C while remaining minimax. 

Since E(Y — a) 2 1 (Y > a) is a decreasing function of a (Problem 6.14), the value 
c = 2 minimizes (6.28) and therefore, among the estimators (6.25), determines the 
minimax estimator with minimum Bayes risk. However, for c = 2, 8 C has the same 
(constant) risk as X, so we are trading optimal Bayes risk for minimal frequentist 
risk improvement over X, the constant risk minimax estimator. Thus, it may be 
better to choose c = 1, which gives optimal frequentist risk performance and still 
provides good Bayes risk reduction over 8'. Table 6.1 shows the relative Bayes 
savings 

, _ r(jr, 8 C ) - r(jr, 8n 
r( tv, 8 C ) 


for c = 1. 


Table 6.1. Values ofr*, the Relative Bayes Risk Savings of8 R over S c , with c = 1 


r 

3 

4 

5 

7 

10 

20 

r* 

.801 

.736 

.699 

.660 

.629 

.587 


For other approaches to this “compromise” decision problem, see Bickel (1983, 
1984), Kempthorne (1988a, 1988b), and DasGupta and Rubin (1988). || 

Thus far, we have considered only continuous distributions, but the Stein effect 
continues to hold also in discrete families. Minimax proofs in discrete families 
have developed along two different lines. The first method, due to Clevenson and 
Zidek (1975), is illustrated by the following result. 

Theorem 6.8 Let X ,• ~ Poisson (a, ), i = 1, ..., r, r > 2, be independent, and let 
the loss be given by 

r 

L(X,S) = '^(X i -8 i ) 2 /X i . 

i=1 


(6.30) 



5.6] 


EXTENSIONS 


373 


The estimator 


<S„(x) = 1 - 


c(Ex,) 


\ TiXi + b ) 

is minimax if 

(i) c(-) is nondecreasing, 

(ii) 0 < c(-) < 2 (r — 1 ), 

(Hi) b > r — 1. 

Recall (Corollary 2.20) that the usual minimax estimator here is X, with constant 
risk r. Note also that, in contrast to the normal-squared-error-loss case, by (ii) there 
exist positive values of c for which S cz is minimax provided r >2. 

Proof. If Z = EX;, the risk of S cz can be written as 


R(X,8 CZ )=E £-( ki-Xi 


c(Z)X, 


= r+2E 


+e 1^5 x 'T 

Let us first evaluate the expectations conditional on Z. The distribution of X,\Z 
is multinomial with E(X,\Z) = Z(A,/A) and var(X, |Z) = Z(A,/A)(l — A.,-/A), 
where A = Z7., . Hence, 

(6.33) E 'jrx i fri-X i )\Z =?-[A-(Z + r-l)] 


E J2 X <^ Z =^(Z + r-l), 


and, so, after some rearrangement of terms, 

c(Z)Z T Z + r-111 

R(k,8 cz ) = r + E\ - 2(A — Z) — 2(r — 1) + c(Z) . 

A (Z + b) |_ Z + b J J 

(6.34) 

Now, if b > r — 1, z + r — 1 < z + b, and c(z) < 2 (r — 1), we have 

—2 (r - 1) + c(z) Z + r ~ 1 < —2(r - 1) + c(z) < 0, 

z + b 

so the risk of <5 c; is bounded above by 

R (X.«<r + 2 E [(^L)( A -Z)_. 

But this last expectation is the product of an increasing and a decreasing function 
of z\ hence, by Lemma 6.6, 

/ c(Z)Z \ c(Z)Z 

(6.35) E ( --— (A-Z)<£--— E(A — z) = 0, 

\A(Z + b)J A (Z + b) 


R (\, S cz ) < r + 2E 


(A - Z) 
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since Z ~ Poisson(A). Hence, R(1 , 8 CZ ) < r and 8 CZ is minimax. □ 

If we recall Example 4.6.6 [in particular. Equation (4.6.29)], we see a similarity 
between 8 CZ and the hierarchical Bayes estimators derived there. It is interesting 
to note that 8 CZ is also a Bayes estimator (Clevenson and Zidek 1975; see Problem 
6.15). 

The above method of proof, which relies on being able to evaluate the conditional 
distribution of X, \ Z X, and the marginal distribution of A,-, works for other 
discrete families, in particular the negative binomial and the binomial (where n is 
the parameter to be estimated). (See Problem 6.16.) However, there exists a more 
powerful method (similar to that of Stein’s lemma) which is based on the following 
lemma due to Hudson (1978) and Hwang (1982a). The proof is left to Problem 
6.17. 


Lemma 6.9 Let X,, i = 1, ..., r, be independent with probabilities 
(6.36) Pi (x\0 t ) = amhdxW *, a- = 0, 1,... , 


that is, Pi(x\9j ) is in the exponential family. Then, for any real-valued function 
g(x) with Eg |g(X)| < oo, and any number m for which g(x) = 0 when x + i < m. 


(6.37) 


EgdfglX) = Eg 


g(X - me,-) 


h,(X, — m) | 

hi(Xi) J 


where e,- is the unit vector with ith coordinate equal to 1 and the rest equal to 0. 


The principal application of Lemma 6.9 is to find an unbiased estimator of the 
risk of estimators of the form X + g(X), analogous to that of Corollary 4.7.2. 


Theorem 6.10 Let X\,..., X, be independently distributed according to (6.36), 
and let <S°(x) = {hfXj — l)//t,(x,)} [the estimator whose ith coordinate is hfxi — 
\)/hi(Xi)] be the UMVU estimator of 6. For the loss function 

r 

(6.38) L m (e,8) = Y J 0T i (0 i -8 i ) 2 , 

;=i 

where m = (mi, .... m r ) are known numbers, the risk of the estimator <5(x) = 
<5°(x) + g(x) is given by 

(6.39) R(6, 8) = R(6 , 5°) + E e V(x) 


with 


(6.40) V(x) = 

1 = 1 


2hi(xj - nij 
hi(xi) 



w,e, 


- e, ) - gj(x - m, e, )] 


+ 


hfx, - nij) 
hj(xj) 



E«r?,(x) 

i=i 


/i,(X,-l) \| 

hi(Xj) )) 


Proof. Write 

R(0,8) = R(0,8°)-2Eg 
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+ E e jx>;"g, 2 (X)J . 

and apply Lemma 6.9 (see Problem 6.17.) □ 

Hwang (1982a) established general conditions on g(x) for which 'D(x) < 0, 
leading to improved estimators of 6 (see also Ghosh et al. 1983, Tsui 1979a, 
1979b, and Hwang 1982b). We will only look at some examples. 

Example 6.11 Improved estimation for independent Poissons. The Clevenson- 
Zidek estimator (6.31) dominates X (and is minimax) for the loss L_ | ((), 8 ) of 
(6.38); however. Theorem 6.8 does not cover squared error loss, Lq(0 , 8). For this 
loss, if Xi ~ Poissond,), independent, the risk of an estimator <5(x) = x + g(x) is 
given by (6.39) with 8° = x and 


(6.41) V(x) = ^ [2 [g,(x) - gi(x - e,)] + g, 2 (x)} . 

i=l 


The estimator with 


(6.42) 


c(x)k(xj) 

T !]=i k(xj)k(xj + 1)’ 


X 


k(x) = ^2 

i =l 


1 

7 ’ 


and c(x) nondecreasing in each coordinate with 


0 < c(x) < 2 [#(x-s > 1) - 2] 


satishes 'D(x) < 0 and hence dominates x under Lq. (The notation #(a t s > b) 
denotes the number of a, s that are greater than b.) 

For the loss function L_\(0, 8), the situation is somewhat easier, and the esti¬ 
mator x + g(x), with 


(6.43) 


gi(x) = c(x - e, ) 


Em- 


where c(-) is nondecreasing with 0 < c(-) < 2 (p — 1), will satisfy V(x) < 0 
and, hence, is minimax for L_i. Note that (6.43) includes the Clevenson-Zidek 
estimator as a special case. (See Problem 6.18.) i 


As might be expected, these improved estimators, which shrink toward 0, per¬ 
form best and give the greatest risk improvement, when the 0, ’s are close to zero 
and, more generally, when they are close together. Numerical studies (Clevenson 
and Zidek 1975, Tsui, 1979a, 1979b, Hudson and Tsui, 1981) quantify this im¬ 
provement, which can be substantial. Other estimators, which shrink toward other 
targets in the parameter space, can optimize the region of greatest risk reduction 
(see, for example, Ghosh et al. 1983, Hudson 1985). 

Just as the minimaxity of Stein estimators carried over from the normal distribu¬ 
tion to mixtures of normals, minimaxity carries over from the Poisson to mixtures 
of Poissons, for example, the negative binomial distribution (see Example 4.6.6). 
Example 6.12 Improved negative binomial estimation. For X \,...,, X r inde¬ 
pendent negative binomial random variables with distribution 

C'T 1 ) 


(6.44) 


Pi(x\0i) = 


0f(\ -Oif, * = 0,1,... 
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the UMVU estimator of 0,- is 5 ®(jc,-) = Xj/(xj + t ,• — 1) (where <5 ( °(0) = 0 for f, = 1). 
Using Theorem 6.10, this estimator can be improved. For example, for the loss 
L_i(0, S ) of (6.38), the estimator <5o(x) + g(x), with 


(6.45) 


00 = 


c(x - e,)x, 
— *•'"/ + V/» 


and c(-) nondecreasing with 0 < c < 2(r — 2)/min,{/,}, satisfies V(x) < 0 and, 
hence, has uniformly smaller risk than <5°(x). Similar results can be obtained for 
other loss functions (see Problem 6.19). Surprisingly, however, similar domination 
results do not hold for the MLE 0,- = x,-/(x,- + f,). Chow (1990) has shown that the 
MLE is admissible in all dimensions (see also Example 7.14). j 


Finally, we turn to a situation where the Stein effect fails to yield improved 
estimators. 

Example 6.13 Multivariate binomial. For X ,• ~ 0(0,-, n,), i = 1inde¬ 
pendent, that is, with distribution 

(6.46) A(*|0.) = h ) 9-( 1 - 9d n> -\ x = 0, 1,..., n u 

it seems reasonable to expect that estimators of the form x+g(x) exist that dominate 
the UMVU estimator x. This expectation is partially based on the fact that (6.46) is a 
discrete exponential family. However, Johnson (1971) showed that such estimators 
do not exist in the binomial problem for squared error loss (see Example 7.23 and 
Problem 7.28). 

Theorem 6.14 If kfOi), i = 1, ..., r, are continuous functions and <5, (x,) is an 
admissible estimator of C, (0, ) under squared error loss, then (S^xO, .... S r (x r )) is 
an admissible estimator of(k\(6f), ..., k r (6 r )) under sum-of-squared-error loss. 

Thus, there is no “Stein effect” in the binomial problem. In particular, as X t 
is an admissible estimator of 0, under squared error loss (Example 2.16), X is an 
admissible estimator of 0. j 


It turns out that the absence of the Stein effect is not a property of the binomial 
distribution, but rather a result of the finiteness of the sample space (Gutmann 
1982a; see also Brown 1981). See Note 9.7 for further discussion. 


7 Admissibility and Complete Classes 

In Section 1.7, we defined the admissibility of an estimator which can be formally 
stated as follows. 

Definition 7.1 An estimator S = <$(X) of 0 is admissible [with respect to the loss 
function L(0 , 5)] if there exists no estimator S' that satisfies 

(7.1) (i) R(0, S') < R(0,8) for all 0, 

(ii) R(0 , S') < R(0 , 8) for some 0. 

where R(0 , 8) = EgL(0 , 8 ). If such an estimator 8 ' exists, then 8 is inadmissible. 
When a pair of estimators 8 and 8' satisfy (7.1), S' is said to dominate 8. 
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Although admissibility is a desirable property, it is a very weak requirement. 
This is illustrated by Example 2.23, where an admissible estimator was completely 
unreasonable since it used no information from the relevant distribution. Here is 
another example (from Makani 1977, who credits it to L.D. Brown). 

Example 7.2 Unreasonable admissible estimator. Let X\ and A 2 be indepen¬ 
dent random variables, X ,■ distributed as Nit ),, 1), and consider the estimation of 
@1 with loss function L{(6\, 62 ), 8 ) = ( 0\ — 8 ) 2 . Then, S = sign(X 2 ) is an admissi¬ 
ble estimator of 0 \, although its distribution does not depend on @ 1 . The result is 
established by showing that 8 cannot be simultaneously beaten at (@ 1 , # 2 ) = (1, O 2 ) 
and (—1, O 2 ). (See Problem 7.1.) || 


Conversely, there exist inadmissible decision rules that perform quite well. 

Example 7.3 The positive-part Stein estimator. For X ~ N r (6 , 1), the positive- 
part Stein estimator 

* ( r — 2 

(7.2) 5 + (x) = 1 - | . 

V |x | 2 

is a good estimator of 6 under squared error loss, being both difficult to improve 
upon and difficult to dominate. However, as was pointed out by Baranchik (1964), 
it is not admissible. (This follows from Theorem 7.17, as <5 + is not smooth enough 
to be a generalized Bayes estimator.) Thus, there exists an estimator that uniformly 
dominates it. 

How much better can such a dominating estimator be? Efron and Morris (1973a, 
Section 5) show that <5 + is “close” to a Bayes rule (Problem 7.2). Brown (1988; see 
also Moore and Brook 1978) writing 



(7.3) 


R(0, S + ) = Eg 


- - s^x )) 2 

r tr 


= EgV$+(X), 


where 

m 2 (x)lxl 2 2 

ry(x) = 1 +-—- {(r - 2)m(x) + 2/[m(x) = 1]} 

r r 

with m(x) = min{l, c(r — 2)/|x| 2 } (see Corollary 4.7.2), proves that no estimator 
S exists for which T>$(x) < 'D; r (x) for all x. These observations imply that the 
inadmissible <5 + behaves similar to a Bayes rule and has a risk that is close to that 
of an admissible estimator. 1 


However, since admissibility generally is a desirable property, it is of interest to 
determine the totality of admissible estimators. 

Definition 7.4 A class of C of estimators is complete if for any 8 not in C there 
exists an estimator S' in C such that (7.1) holds; C is essentially complete if for any 
8 not in C there exists an estimator 8 ' in C such that (7.1 )(i) holds. 

It follows from this definition that any estimator outside a complete class is 
inadmissible. If C is essentially complete, an estimator 8 outside of C may be 
admissible, but there will then exist an estimator S' in C with the same risk function. 
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It is therefore reasonable, in the search for an optimal estimator, to restrict 
attention to a complete or essentially complete class. The following result provides 
two examples of such classes. 

Lemma 7.5 

(i) IfC is the class of all (including randomized) estimators based on a sufficient 
statistic, then C is essentially complete. 

(ii) If the loss function L(0 , d) is convex in d, then the class of nonrandomized 
estimators is complete. 

Proof. These results are immediate consequences of Theorem 1.6.1 and Corollary 
1.7.9. □ 


Although a complete class contains all admissible estimators, it may also contain 
many inadmissible ones. (This is, for example, the case for the two complete classes 
of Lemma 7.5.) A complete class is most useful if it is as small as possible. 

Definition 7.6 A complete class C of estimators is minimal complete if no proper 
subset of C is complete. 

Lemma 7.7 If a minimal complete class C exists, then it is exactly the class of all 
admissible estimators. 


Proof. It is clear that C contains all admissible rules, so we only need to prove 
that it cannot contain any inadmissible ones. Let 8 e C and suppose that 8 is 
inadmissible. Then, there is a 8 ' e C that dominates it, and, hence, the class C \ {<$} 
(C with the estimator 8 removed) is a complete class. This contradicts the fact that 
C is minimal complete. □ 

Note that Lemma 7.7 requires the existence of a minimal complete class. The 
following example illustrates the possibility that a minimal complete class may 
not exist. (For another example, see Blackwell and Girshick 1954, Problem 5.2.1.) 


Example 7.8 Nonexistence of a minimal complete class. Let X be normally 
distributed as N(9, 1) and consider the problem of estimating 9 with loss function 


L(9, d) = 


d — 9 
0 


if 9 < d 
ae > d. 


Then, if S(y) < <5'(.r) for all x, we have R(9, 8) < R(9, S') with strict inequality if 
PelS(X) < S'(X)] >0. 

Many complete classes exist in this situation. For example, if <5 (l is any estimator 
of 9, then the class of all estimators with 8 (x) < 5o(x) for some x is complete 
(Problem 7.4). We shall now show that there exists no minimal complete class. 
Suppose C is minimal complete and So is any member of C. Then, some estimator 
<5i dominating Sq must also lie in C. If not, there would be no members of C left to 
dominate such estimators and C would not be complete. On the other hand, if <5| 
dominates 5o, and <5i and So are both in C, the class C is not minimal since <5o could 
be removed without disturbing completeness. j 
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Despite this example, the minimal complete class typically coincides with the 
class of admissible estimators, and the search for a minimal complete class is there¬ 
fore equivalent to the determination of all admissible estimators. The following 
results are concerned with these two related aspects, admissibility and complete¬ 
ness, and with the relation of both to Bayes estimators. 

Theorem 2.4 showed that any unique Bayes estimator is admissible. The fol¬ 
lowing result replaces the uniqueness assumption by some other conditions. 

Theorem 7.9 For a possibly vector-valued parameter 0, suppose that 8 n is a 
Bayes estimator having finite Bayes risk with respect to a prior density 7t which 
is positive for all 0, and that the risk function of every estimator 8 is a continuous 
function of 6. Then, 8 n is admissible. 

Proof. If 8 71 is not admissible, there exists an estimator 8 such that 
R(0,8) < R(0,8 n ) for all 0 

and 

R(0. 8) < R(0, 8 n ) for some 6. 

It then follows from the continuity of the risk functions that R(0 , 8) < R(0, 8 71 ) 
for all 0 in some open subset £2o of the parameter space and hence that 

J R(0,8)ix(0)d0 < J R(0,8 n )Ti(0)d0, 

which contradicts the definition of 8 n . □ 

A basic assumption in this theorem is the continuity of all risk functions. The 
following example provides an important class of situations for which this assump¬ 
tions holds. 

Example 7.10 Exponential families have continuous risks. Suppose that we let 
p(x\r}) be the exponential family of (5.2). Then, it follows from Theorem 1.5.8 
that for any loss function L()j, 8) for which R(r}, 8) = E n L(ri, 8) is finite, R(rj, 8) 
is continuous. (See Problem 7.6.) j 

There are many characterizations of problems in which all risk functions are 
continuous. With assumptions on both the loss function and the density, theorems 
can be established to assert the continuity of risks. (See Problem 7.7 for a set of 
conditions involving boundedness of the loss function.) The following theorem, 
which we present without proof, is based on a set of assumptions that are often 
satisfied in practice. 

Theorem 7.11 Consider the estimation of 9 with loss L(9, 8), where X ~ f(x\9 ) 
has monotone likelihood ratio and is continuous in 9 for each x. If the loss function 
L(9, 8) satisfies 

(i) L(9, 8) is continuous in 9 for each 8, 

(ii) L is decreasing in 8 for 8 < 9 and increasing in 8 for 8 > 9, 

(Hi) there exist functions a and b, which are bounded on all bounded subsets of 
the parameter space, such that for all 8 

L(9, 8) < a(9, 9')L(9', 8) + b(9, 9'), 
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then the estimators with finite-valued, continuous risk functions R(6, 8) = EgL(9, 8) 
form a complete class. 

Theorems similar to Theorem 7.11 can be found in Ferguson (1967, Section 3.7, 
Brown 1986a, Berk, Brown, and Cohen 1981, Berger 1985, Section 8.8, or Robert 
1994a, Section 6.2.1). Also see Problem 7.9 for another version of this theorem. 

The assumptions on the loss are relatively simple to check. In fact, assumptions 
(i) and (ii) are almost self-evident, whereas (iii) will be satisfied by most interesting 
loss functions. 

Example 7.12 Squared error loss. For L(9, 8) = (9 — 8) 2 , we have 
(7.4) (9 - <5) 2 = (9 -9' + 9' - 8 ) 2 

= (9 - 9') 2 + 2(9 - 9')(9' - 8) + ( 9' - S) 2 . 

Now, since 2 xy < x 1 + y 2 , 

2(9 - 9')(9' — 8) < (9 — 9’) 2 + (9’ - 8) 2 

and, hence, 

(9 - 8) 2 < 2(9 - 9') 2 + (9' - 8) 2 , 

so condition (iii) is satisfied with a(9 , 9') = 2 and b(9 , 9') = 2(9 — 9') 2 . || 


Since most problems that we will be interested in will satisfy the conditions of 
Theorem 7.11, we now only need consider estimators with finite-valued continuous 
risks. Restriction to continuous risk, in turn, allows us to utilize the method of 
proving admissibility that we previously saw in Example 2.8. (But note that this 
restriction can be relaxed somewhat; see Gajek 1983.) The following theorem 
extends the admissibility of Bayes estimators to sequences of Bayes estimators. 

Theorem 7.13 (Blyth’s Method) Suppose that the parameter space £2 e i)T is 
open, and estimators with continuous risk functions form a complete class. Let 
8 be an estimator with a continuous risk function, and let {tt„} be a sequence of 
(possibly improper) prior measures such that 


(a) r(n n ,8) < oo for all n, 

(b) for any nonempty open set £2q € £2, there exist constants B > 0 and N such 


that 



jr„(9)d9 > B 


for all n > N, 


(c) r(n n , 8) — r(jt n , S 71 ") -> 0 as n -» oo. 


Then, 8 is an admissible estimator. 

Proof. Suppose 8 is inadmissible, so that there exists 8' with R(9, S') < R(9, 8), 
with strict inequality for some 9. By the continuity of the risk functions, this implies 
that there exists a set £2o and s > 0 such that R(9, 8) — R(9 , S') > e for 9 e £2o. 
Hence, for all n > N, 

(7.5) r(it n , 8) — r(jr n , 8') > e I 7t n (9)d9>sB 

J Q 0 


and therefore (c) cannot hold. 


□ 
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Note that condition (b) prevents the possibility that as n —> oo, all the mass 
of 7T„ escapes to oo. This is similar to the requirement of tightness of a family 
of measures (see Chung 1974, Section 4.4, or Billingsley 1995, Section 25). [It 
is possible to combine conditions (b) and (c) into one condition involving a ratio 
(Problem 7.12) which is how Blyth’s method was applied in Example 2.8.] 


Example 7.14 Admissible negative binomial MLE. As we stated in Example 
6.12 (but did not prove), the MLE of a negative binomial success probability is 
admissible under squared error loss. We can now prove this result using Theorem 
7.13. 

Let X have the negative binomial distribution 

^ 9 X (\ — 0) r , 0 < 6> < 1. 


(7.6) 


p(x\0) = 


r + x 
x 


The ML estimator of 9 is S°(x) = xfix + r). 

To use Blyth’s method, we need a sequence of priors n for which the Bayes risks 
r(n, 8 jr ) get close to the Bayes risk of 8°. When 9 has the beta prior jt = B(a, b), 
the Bayes estimator is the posterior mean 8 11 = {x + a)/(x + r + a + b). Since 
8 n (x) —> 5 0 (a) as a, b —> 0, it is reasonable to try a sequence of priors B(a, b) 
with a, b -» 0. 

It is straightforward to calculate the posterior expected losses 


E{[8\x)-B] 2 \x) 

(7.7) 

E{[8\x)-9] 2 \x} 


(x + a)(r + b) 

(x + r + a + b) 2 (x + r + a + b + 1) 


(bx — ar ) 2 

(.r + r) 2 (x + /■+« + b) 2 


+ E{[8 n {x)-9] 2 \x}, 


and hence the difference is 


(7.8) 


D(x) = 


(bx — ar) 2 

(x + r) 2 (x + r + a + b) 2 


Before proceeding further, we must check that the priors satisfy condition (b) of 
Theorem 7.13. [The normalized priors will not, since, for example, the probability 
of the interval (e. 1 — e) under B(a. b) tends to zero as a, b -> 0.] Since we are 
letting a, b —> 0, we only need consider 0 < a , b < 1. We then have for any 

0 <£<£■'< 1 , 


(7.9) 


f 


e a ~\\ - ey- 1 de 


b -1 


f 


3 - 1 / 


(1 -9)~ l d9 = log 



satisfying condition (b). 

To compute the Bayes risk, we next need the marginal distribution of X, which 
is given by 

T(r + A) T(r + b)T(.r + a) T(fl + b) 

V(x + l)T(r) T(r + x + a + b) r(a)F(b) 


(7.10) 


P(X = x) = 
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the beta-Pascal distribution. Hence, the difference in Bayes risks, using the un¬ 
normalized priors, is 


T(r + x) r(r + b)T(x + a) 
F(x + l)T(r) T(r + X + a + b ) 


which we must show goes to 0 as a, b 0. 

Note first that the x=0 term in (7.11) is 

r(r)r(<7 + 1) a 
r (r + a + b) (r + a + b ) 2 


as a -> 0. Also, for x > 1 and a < 1, r^+D nr+v+n+fc) — 1- so it is sufficient to 
show 

OO 

D(x) —>• 0 as a, b -> 0. 

x=l 

From (7.8), using the facts that 


(bx — ar)~ , 7 

sup-r— = max{a , b~} and 

x>o (x + r) 1 


1 1 

- — 

(x + r + a + b ) 2 x 2 


we have 


OO OO | 

y, D(x) < maxffl 2 , fo 2 } y — 


^=i 


,r=l 


as a, b 0, establishing the admissibility of the ML estimator of 6. 


Theorem 7.13 shows that one of the sufficient conditions for an estimator to be 
admissible is that its Bayes risk is approachable by a sequence of Bayes risks of 
Bayes estimators. It would be convenient if it were possible to replace the risks by 
the estimators themselves. That this is not the case can be seen from the fact that 
the normal sample mean in three or more dimensions is not admissible although 
it is the limit of Bayes estimators. 

However, under certain conditions the converse is true: That every admissible 
estimator is a limit of Bayes estimators. 4 We present, but do not prove, the follow¬ 
ing necessary conditions for admissibility. (This is essentially Theorem 4A.12 of 
Brown (1986a); see his Appendix to Chapter 4 for a detailed proof.) 

Theorem 7.15 Let X ~ f(x\9) be a density relative to a a-finite measure v, such 
that f(x\9 ) > 0 for allx e X , 9 e f2. Let the loss function L(9 , 8) be continuous, 
strictly convex in 8 for every 9, and satisfy 


lim L(9,8) = oo for all 9e£2,. 

| <5| —> oo 

Then, to every admissible procedure 5(x) there corresponds a sequence tt n of prior 
distributions with support on a finite set (and hence with finite Bayes risk) for which 

(7.12) ^"(x) -* 5(x) a.e. (v), 


where 8 n " is the Bayes estimator for tr n . 

4 The remaining material of this section is of a somewhat more advanced nature. It is sketched here 
to give the reader some idea of these developments and to serve as an introduction to the literature. 
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As an immediate corollary to Theorem 7.15, we have the following complete 
class theorem. 

Corollary 7.16 Under the assumptions of Theorem7.15, the class of all estimators 
5(x) that satisfy (7.12) is complete. 

For exponential families, the assumptions of Theorem 7.15 are trivially satisfied, 
so limits of Bayes estimators are a complete class. More importantly, if X has a 
density in the r-variate exponential family, and if S 71 is a limit of Bayes estimators 
8 Kn , then a subsequence of measures can be found such that n n > —>• n and 8 71 
is generalized Bayes against jt. Such a result was originally developed by Sacks 
(1963) and extended by Brown (1971) and Berger and Srinivasan (1978) to the 
following theorem. 

Theorem 7.17 Under the assumptions of Theorem 7.15, if the densities ofX con¬ 
stitute an r-variate exponential family, then any admissible estimator is a gener¬ 
alized Bayes estimator. Thus, the generalized Bayes estimators form a complete 
class. 

Further characterizations of generalized Bayes estimators were given by Straw- 
derman and Cohen (1971) and Berger and Srinivasan (1978). See Berger 1985 
for more details. Note that it is not the case that all generalized Bayes estimators 
are admissible. Farrell (1964) gave examples of inadmissible generalized Bayes 
estimators in location problem, in particular X ~ N(0. 1), tt( 0) = e e . (See also 
Problem 4.2.15.) Thus, it is of interest to determine conditions under which gen¬ 
eralized Bayes estimators are admissible. We do so in the following examples, 
where we look at a number of characterizations of admissible estimators in spe¬ 
cific situations. Although these characterizations have all been derived using the 
tools (or their generalizations) that have been described here, in some cases the 
exact derivations are complex. 

We begin with a fundamental identity. 

Example 7.18 Brown’s identity. In order to understand what types of estimators 
are admissible, it would be helpful if the convergence of risk functions in Blyth’s 
method were more explicitly dependent on the convergence of the estimators. 
Brown (1971) gave an identity that makes this connection clearer. 

Let X ~ N r (0, /) and L(6 , 5) = 1 6 — 8\ 2 , and for a given prior :t(0), let 
5 lr (x) = x+ V log m n (x) be the Bayes estimator, where m n (x) = f Q f(x\0)Tx(0) dO 
is the marginal density. Suppose that S*(x) = x + V log m g (x) is another estimator. 
First note that 

(7.13) r(jt, 8 n ) - r(n , 8 s ) = E |5"(X) - S«(X)| 2 , 

(see Problem 7.16); hence, we have the identity 

(7.14) r(jt, 8 n ) — r(jr, 8 s ) = J |Vlogm^(x) — Vlogm g (x)| 2 m w (x) dx. 

We now have the estimator explicitly in the integral, but we must develop (7.14) 
a bit further to be more useful in helping to decide admissibility. Two paths have 
been taken. On the first, we note that if we were going to use (7.14) to establish the 
admissibility of 8 g , we might replace the prior n(-) with a sequence tc„(-). However, 
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it would be more useful to have the measure of integration not depend on n [since 
m n {x) would now equal m„ n (x)]. To this end, write k n (x) = m 7ln {x)/m g {x ), and 


(7.15) r(jt, 8 n ) — r(jt, 8 g ) = J | V log (m„ n (x)/m g (x)) | 2 m Wn (x) dx 

I VA'„(x)| 2 


/ 


&»(*) 


-m g (x) dx. 


where the second equality follows from differentiation (see Problem 7.16), and now 
the integration measure does not depend on n. Thus, if we could apply Lebesgue’s 
dominated convergence, then 1V ^" ( ( ^ ) )I —> 0 would imply the admissibility of 8 g . 
This is the path taken by Brown (1971), who established a relationship between 
(7.15) and the behavior of a diffusion process in r dimensions, and then gave 
necessary and sufficient conditions for the admissibility of 8 g . For example, the 
admissibility of the sample mean in one and two dimensions is linked to the 
recurrence of a random walk in one and two dimensions, and the inadmissibility 
is linked to its transience in three or more dimensions. This is an interesting and 
fruitful approach, but to pursue it fully requires the development of properties of 
diffusions, which we will not do here. [Johnstone 1984 (see also Brown and Farrell 
1985) developed similar theorems for the Poisson distribution (Problem 7.25), and 
Eaton (1992) investigated another related stochastic process; the review paper by 
Rukhin (1995) provides an excellent entry into the mathematics of this literature.] 
Another path, developed in Brown and Hwang (1982), starts with the estimator 
8 s and constructs a sequence g„ —> g that leads to a simplified condition for the 
convergence of (7.14) to zero, and uses Blyth’s method to establish admissibility. 
Although they prove their theorem for exponential families, we shall only state it 
here for the normal distribution. (See Problem 7.19 for a more general statement.) 


Theorem 7.19 Let X ~ N r (0,I) and L(6 , 8) = 1 0 — <5| 2 . Let 8 g (x) = x + 
V log m g (x ) where m g (x ) = f Q f(x\0)g(0)d0. Assume that g(-) satisfies 

, . f _ gW) _ ,n 

J{0:\6\>i) |0| 2 max{log|0|,log2} 2 < °°’ 

,, , f |Vg(0)| 2 m 

(b) / - dO < oo, 

Ja 8(0) 

(c) sup{/?(0, 8 g ) : 0 e K} < oo for all compact sets K e f2. 

Then, 8 s (x) is admissible. 


Proof. The proof follows from (7.14) by taking the sequence of priors g n (0 ) 
8(0), where g n (0) = h 2 n (0)g(0) and 


(7.16) 


K(0) 


1 if |0| < 1 

1-^ i f ( < \0 \ < n 
0 if |0| > n 


for n = 2,3 


See Problem 7.18 for details. 


□ 
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Example 7.20 Multivariate normal mean. The conditions of Theorem 7.19 re¬ 
late to the tails of the prior, which are crucial in determining whether the integral 
are finite. Priors with polynomial tails, that is, priors of the form g(6 ) = 1 /\0\ k , 
have received a lot of attention. Perhaps the reason for this is that using a Laplace 
approximation (4.6.33), we can write 


<5^(x) = x + V log m g (x) 

Vffi s (x) 

= x +--— 

m g (x) 

(7.17) Vg(x) 

«x+- 

#00 



See Problem 7.20 for details. 

Now what can we say about the admissibility of S g l For g(0) = \/\6 \ k , condition 
(a) of Theorem 7.19 becomes, upon transforming to polar coordinates. 


(7.18) 


L 


8 ( 0 ) 


l0.\0\>i) |#| 2 max{log \0\, log 2}- 

2 71 


p ATT nC 

= J sin' -2 P dfi J 


dO 


1 


t r -1 


fk+2 


max{log(f), log2} 2 


dt 


where t = \6\ and P is a vector of direction cosines. The integral over P is finite, 
and if we ignore the log term, a sufficient condition for this integral to be finite is 

,o° J 

(7.19) / - F ^dt<o o, 

which is satisfied if k > r — 2. If we keep the log term and work a little harder, 
condition (a) can be verified for k > r — 2 (see Problem 7.22). | 


Example 7.21 Continuation of Example 7.20. The characterization of admissi¬ 
ble estimators by Brown (1971) goes beyond that of Theorem 7.19, as he was able 
to establish both necessary and sufficient conditions. Here is an example of these 
results. 

Using a spherically symmetric prior (see, for example. Corollary 4.3.3), all 
generalized Bayes estimators are of the form 

(7.20) ^(x) = x + V log m(x) = (1 - h( |x|))x. 

The estimator S n is 

(a) inadmissible if there exists e > 0 and M < oo such that 

0 < /?(|x|) < -— " for |x| > M, 

|x| 2 


(b) admissible if /z(|x|)|x| is bounded and there exists M < oo such that 


1 > h(\%\) > 


r — 2 


for |x| > M. 
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It is interesting how the factor r — 2 appears again and supports its choice as 
the optimal constant in the James-Stein estimator (even though that estimator is 
not generalized Bayes, and hence inadmissible). Bounds such as these are called 
semitail upper bounds by Hwang (1982b), who further developed their applicabil¬ 
ity. 

For Strawderman’s estimator (see Problem 5.5), we have 


(7.21) 


A(|X|): 


r -2a+2 P(x-- 2a+ 4< 1*1 ) 
l*l 2 P(x;- 2a+ 2 < l x l 2 ) 


and it is admissible (Problem 7.23) as long as r — 2a + 2 > r — 2, or r > 1 and 
a <2. II 


Now that we have a reasonably complete picture of the types of estimators 
that are admissible estimators of a normal mean, it is interesting to see how the 
admissibility conditions fit in with minimaxity conditions. To do so requires the 
development of some general necessary conditions for minimaxity. This was first 
done by Berger (1976a), who derived conditions for an estimator to be tail minimax. 

Example 7.22 Tail minimaxity. Let X ~ N, (0, /) and HO . 8) = \0 — <51 2 . Since 
the estimator X is minimax with constant risk R(0 , X), another estimator S is tail 
minimax if there exists M > 0 such that R(0, 8) < R(0 , X) for all \0\ > M. 
(Berger investigated tail minimaxity for much more general situations than are 
considered here, including non-normal distributions and nonquadratic loss.) Since 
tail minimaxity is a necessary condition for minimaxity, it can help us see which 
admissible estimators have the possibility of also being minimax. An interest¬ 
ing characterization of /z(|x|) of (7.20) is obtained if admissibility is considered 
together with tail minimaxity. 

Using a risk representation similar to (5.4), the risk of <5(x) = [1 — /z(|x|)]x is 

(7.22) R(0, 8) = r + Eg [|X| 2 /z 2 (|X|) - 2rft(|X|) - 4|X| 2 /z'(|X|)]. 

If we now use a Laplace approximation on the expectation in (7.22), we have 
Eg [|X| 2 /z 2 (|X|) - 2r/?(|X|) - 4|X|V(|X|)] 

» \0\ 2 h 2 m-2rh(\0\)-4\0\ 2 h\\6\) 

(7.23) = B(0). 

By carefully working with the error terms in the Laplace approximation, Berger 
showed that the error of approximation was o(\0 1~ 2 ), that is, 

(7.24) R(0 ,5) = r + B(0) + o(\0\~ 2 ). 

In order to ensure that the estimator is tail minimax, we must be able to ensure 
that B(0 ) + o(\0\ 2 ) < 0 for sufficiently large \0\. This would occur if, for some 
e > 0, \0\~ 2 B(0) < —e for sufficiently large \0\, that is, 

(7.25) \0\ 2 h 2 (\0\)-2rh(\0\)-4\0\ 2 h'(\0\)<^_ 
for sufficiently large \0\. 
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Now, for <5(x) = [1 — /z(|x|)]x to be admissible, we must have /i(|x|) > (r — 
2)/|x| 2 . Since |x|/z(|x|) must be bounded, this suggests that, for large |x|, we could 
have /z(|x|) & k/|x| 2 “, for some a, 1/2 < a < 1. We now show that for <5(x) to be 
minimax, it is necessary that a = 1. 

For /z(|x|) = k/|x| 2 “, (7.25) is equal to 


(7.26) 


\ 0 \ 


2 a 


2(r - 2a) 


< 


—s 

w 


for Ixl > M 


\0\2a-2 

which, for r > 3, cannot be satisfied if 1/2 < a < 1. Thus, the only possible 
admissible minimax estimators are those for which /z(|x|) ~ k/|x| 2 , with r — 2 < 
k < 2{r - 2). || 


Theorem 7.17 can be adapted to apply to discrete distributions (the assumption 
of a density can be replaced by a probability mass function), and an interesting 
case is the binomial distribution. It turns out that the fact that the sample space is 
finite has a strong influence on the form of the admissible estimators. We first look 
at the following characterization of admissible estimators, due to Johnson (1971). 

Example 7.23 Binomial estimation. For the problem of estimating h{p), where 
/;(•) is a continuous real-valued function on [0, 1], X ~ binomial! p, n), and 
L(h(p ), <5) = (h(p) — 8 ) 2 , a minimal complete class is given by 


(7.27) S n (x) 


h{ 0) 

fo h(p)p x - r - l (l-py- x - , dn(p) 
/„' p x - r -\l-py- x - l dn(p) 

h( 1) 


if x < r 

if r + l<x<s — 1 
if x > s, 


where p has the prior distribution 


(7.28) p ~ k(p)dn(p) 

with 

, , , P(r + 1 < X < s - l|p) 
iP) ~ pr+\\ - pY~*+i 

r and v are integers, — I < r < s < n + \, and jr is a probability measure with 
7r({0} U {1}) < 1, that is, n does not put all of its mass on the endpoints of the 
parameter space. 

To see that 8 71 is admissible, let S' be another estimator of h(p) that satisfies 
R(p, 8 7r ) > R(p, S'). We will assume that s > 0, r < n, and r + 1 < s (as the 
cases r = —l, s = n+l, and r + 1 = s are straightforward). Also, if S'(^) = /?(0) for 
x < r', and S'(x) = h( 1) for x > s ', then it follows that r’ > r and ,v' < s. Define 

R r . s (p,8)= ^2 (”) WP) ~ 5(x)] 2 p x_r_1 (l - p) s ~ x ~ l k{p)~ l . 

x=r +1 

Now, R(p. S') < R(p.S n ) for all p e [0, 1] if and only if R, %s (p, S') < R r , s (p, S 71 ) 
for all p e [0, 1]. However, for the prior (7.28), f ( J R r , s (P■> 8) 
xk(p)djt(p) is uniquely minimized by [S n (r + 1), ..., ^(s — 1)], which estab¬ 
lishes the admissibility of S n . The converse assertion, that any admissible estimator 
is of the form (7.27), follows from Theorem 7.17. (See Problem 7.27.) 
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For h(p) = p, we can take r = 0, s = n, and ir(p) = Betafa. b). The re¬ 
sulting estimators are of the form a- + (1 - ff)^, so we obtain conditions on 
admissibility of linear estimators. In particular, we see that x/n is an admissible 
estimator. If h(p) = p( 1 — p), we find an admissible estimator of the variance to 
be n "j (x/n) (1 — x/n). (See Problem 7.26.) 

Brown (1981, 1988) has generalized Johnson’s results and characterizes a min¬ 
imal complete class for estimation in a wide class of problems with finite sample 
spaces. | 

Johnson (1971) was further able to establish the somewhat surprising result that 
if <5i is an admissible estimator of h(p\) in the binomial b(p\, m) problem and 
<52 is an admissible estimator of h(p 2 ) in the binomial b(p 2 , n 2 ) problem, then 
[Si, S 2 \ is an admissible estimator of [h{p\), hipi)] if the loss function is the sum 
of the losses. This result can be extended to higher dimensions, and thus there is 
no Stein effect in the binomial problem. The following example gives conditions 
under which this can be expected. 

Example 7.24 When there is no Stein effect. For i = 1,2, let Y, ~ f) (x \ 0-,) and 

suppose that 8 * (x,-) is a unique Bayes (hence, admissible) estimator of 0/ under the 
loss L( 6 j , 5), where L satisfies L{a, a) = 0 and L(a, a') >0, a a ', and all risk 
functions are continuous. Suppose there is a value 0* such that if 9 2 = 9 *, 

(i) X 2 = x* with probability 1, 

(h) 5 2 *(y*) = e\ 

then (5[(xi), S*{x 2 )) is admissible for (9\, 62 ) under the loss L(9j, <5); that is, 
there is no Stein effect. 

To see why this is so, let S’ = (<5j (jci , Y 2 ), S' 2 (x 1 , X 2 )) be a competitor. At the 
parameter value ( 6 \, 62 ) = {0 \, 9*), we have 

R[(9u 0*), S'] = E^LlOu S[(X U X 2 )] + E^epLiO 2 , S' 2 (X U X 2 )] 

(7.29) = E ( 0 l ,e*)L[e u S[(X u x*)] + E^LIO*, S' 2 (X lt x*)] 

= E 6 l L[0 1 , S'^Xux*)] + E ei L[0*, &' 2 (X\,x*)], 

while for (<5 [(yi), <5|(x2)), 

8 *] = £ (9l>e .)L[0 1 , + E Wu 8>)L[0 2 , S*(X 2 )] 

(7.30) = EthL^.SUXi)] + E e ,L[ 6 \ S*(x*)] 

= E 8 i L[9 l ,S*(X l )] 

as E e ,L[ 6 *,S* 2 {x*)] = 0. 

Since is a unique Bayes estimator of 9 1 , 

Eg x L[9\, <5[(Yi)] < Eg t L[0\, S\(X\, x*)] for some 9\. 

Since Eg x L[ 6 *, 8 ' 2 {X\, x*)] > 0, it follows that fl[(6>i, 9*), 5*] < R[(9 U 9*), S'] 
for some 9 1 , and hence that S* is an admissible estimator of {9 \, 9g). By induction, 
the result can be extended to any number of coordinates (see Problem 7.28). 

If X ~ b(0. n ), then we can take 9* = 0 or 1, and the above result applies. The 
absence of the Stein effect persists in other situations, such as any problem with 
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a finite sample space (Gutmann 1982a; see also Brown 1981). Gutmann (1982b) 
also demonstrates a sequential context in which the Stein effect does not hold (see 
Problem 7.29). || 

Finally, we look at the admissibility of linear estimators. There has always been 
interest in characterizing admissibility of linear estimators, partly due to the ease 
of computing and using linear estimators, and also due to a search for a converse 
to Karlin’s theorem (Theorem 2.14) (which gives sufficient conditions for admis¬ 
sibility of linear estimators). Note that we are concerned with the admissibility 
of linear estimators in the class of all estimators, not just in the class of linear 
estimators. (This latter question was addressed by La Motte (1982).) 

Example 7.25 Admissible linear estimators. Let X ~ N r (0. /), and consider 
estimation of ip'O, where (p ry \ is a known vector, and L((p'0, 8) = (<p'0 — S) 2 . For 
r = 1, the results of Karlin (1958); see also Meeden and Ghosh (1977), show that 
ax is admissible if and only if 0 < a < ip. This result was generalized by Cohen 
(1965a) to show that a'x is admissible if and only if a is in the sphere: 

(7.31) {a : (a - (p/2)\a - (p/2) < <p>/4} 

(see Problem 7.30). Note that the extension to known covariance matrix is straight¬ 
forward, and (7.31) becomes an ellipse. 

For the problem of estimating 6, the linear estimator Cx, where C is an r x r 
symmetric matrix, is admissible if and only if all of the eigenvalues of C are 
between 0 and 1, with at most two equal to 1 (Cohen 1966). 

Necessary and sufficient conditions for admissibility of linear estimators have 
also been described for multivariate Poisson estimation (Brown and Farrell, 1985a, 
1985b) and for estimation of the scale parameters in the multivariate gamma distri¬ 
bution (Farrell et al., 1989). This latter result also has application to the estimation 
of variance components in mixed models. j 


8 Problems 
Section 1 

1.1 For the situation of Example 1.2: 

(a) Plot the risk functions of 8 l/4 , 8 l/2 , and S 3/4 for n = 5, 10, 25. 

(b) For each value of n in part (a), find the range of prior values of p for which each 
estimator is preferred. 

(c) If an experimenter has no prior knowledge of p, which of <5 1/4 , S 1 / 2 , and 5 3/4 would 
you recommend? Justify your choice. 

1.2 The principle of gamma-minimaxity [first used by Hodges and Lehmann (1952); see 
also Robbins 1964 and Solomon 1972a, 1972b)] is a Bayes/frequentist synthesis. An 
estimator 5* is gamma-minimax if 

inf supr(7r, 8 ) = supr(7r, 5*) 
irer 

where T is a specified class of priors. Thus, the estimator <5* minimizes the maximum 
Bayes risk over those priors in the class T. (If T = all priors, then 5* would be minimax.) 
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(a) Show that if T = {7r 0 }, that is, V consists of one prior, then the Bayes estimator is 
T minimax. 

(b) Show that if T = {all priors), then the minimax estimator is T minimax. 

(c) Find the T-minimax estimator among the three estimators of Example 1.2. 

1.3 Classes of priors for T-minimax estimation have often been specified using moment 
restrictions. 

(a) For X ~ b(p, n), find the T-minimax estimator of p under squared error loss, with 

= \x(p) ■ trip) = beta(a, b ), p = —} 

l a + b > 

where p is considered fixed and known. 

(b) For X ~ N(9, 1), find the T-minimax estimator of 8 under squared error loss, with 

r M ,r = {tt(0) : E{8) = p, var 8 = r 2 \ 
where p and r are fixed and known. 

[Hint: In part (b), show that the T-minimax estimator is the Bayes estimator against a 
normal prior with the specified moments (Jackson et al. 1970; see Chen, Eichenhauer- 
Herrmann, and Lehn 1990 for a multivariate version). This somewhat nonrobust T- 
minimax estimator is characteristic of estimators derived from moment restrictions and 
shows why robust Bayesians tend to not use such classes. See Berger 1985, Section 4.7.6 
for further discussion.] 

1.4 (a) For the random effects model of Example 4.2.7 (see also Example 3.5.1), show 

that the restricted maximum likelihood (REML) likelihood of aj and a 2 is given 
by (4.2.13), which can be obtained by integrating the original likelihood against a 
uniform (—oo, oo) prior for p. 

(b) For n t = n in 

Xjj = p + Aj + U ij 0=1,..., Hi, t = 1, S) 

calculate the expected value of the REML estimate of aj and show that it is biased. 
Compare REML to the unbiased estimator of aj. Which do you prefer? 
(Construction of REML-type marginal likelihoods, where some effects are integrated out 
against priors, becomes particularly useful in nonlinear and generalized linear models. 
See, for example, Searle et al. 1992, Section 9.4 and Chapter 10.) 

1.5 Establishing the fact that (9.1) holds, so S 2 is conditionally biased, is based on a 
number of steps, some of which can be involved. Define cj>[a, p, a 2 ) = {\/a 2 )E^ a i [S 2 | 
\x\/s < a]. 

(a) Show that (f>(a , p, a 2 ) only depends on p and ct 2 through p/a. Hence, without loss 
of generality, we can assume a = 1. 

(b) Use the fact that the density f(s \ |i|/,S' < a, p) has monotone likelihood ratio to 
establish ip(a. p, 1) > <p(a, 0, 1). 

(c) Show that 

^ 5 3 ^ 

lim (f>(a , 0, 1) = 1 and lim (f>(a, 0, 1) = ——— =-. 

a^oo a^O Eo IS n — 1 

(d) Combine parts (a), (b), and (c) to establish (19.1). 
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The next three problems explore conditional properties of estimators. A detailed de¬ 
velopment of this theory is found in Robinson (1979a, 1979b), who also explored the 
relationship between admissibility and conditional properties. 

1.6 Suppose that X ~ f(x\9), and T(x) is used to estimate r (9). One might question 
the worth of T(x) if there were some set A G X for which T(x) > r (9) for x G A (or 
if the reverse inequality holds). This leads to the conditional principle of never using an 
estimator if there exists a set A G X for which Eg{[T(X) — r (9)]I(X G A)) > 0 V0, 
with strict inequality for some 9 (or if the equivalent statement holds with the inequality 
reversed). Show that if T(x) is the posterior mean of r (9) against a proper prior, where 
both the prior and f(x\9) are continuous in 9, then no such A can exist. (If such an A 
exists, it is called a semirelevant set. Elimination of semirelevant sets is an extremely 
strong requirement. A weaker requirement, elimination of relevant sets, seems more 
appropriate.) 

1.7 Show that if there exists a set A G X andane > 0 for which E g {[T(X) — t{9)\I(X g 
A)) > e, then T(x) is inadmissible for estimating r (9) under squared error loss. (A set 
A satisfying the this inequality is an example of a relevant set.) 

[Hint: Consider the estimator T(x) + el(x G A)] 

1.8 To see why elimination of semirelevant sets is too strong a requirement, consider the 
estimation of 9 based on observing X ~ f(x — 9). Show that for any constant a , the 
Pitman estimator X satisfies 

E e [(X -9)I(X < a)] < 0 VO or E g [(X - 9)1(X > a)] > 0 VO, 

with strict inequality for some 0. Thus, there are semirelevant sets for the Pitman esti¬ 
mator, which is, by most accounts, a fine estimator. 

1.9 In Example 1.7, let S*(A) = X/n with probability 1 — e and =1/2 with probability 
e. Determine the risk function of <5* and show that for e = 1 /(n + 1), its risk is constant 
and less than sup R(p, X/n). 

1.10 Find the bias of the minimax estimator (1.11) and discuss its direction. 

1.11 In Example 1.7, 

(a) determine c n and show that c„ —>■ 0 as n —> oo, 

(b) show that R„(\/2)/r n —»■ 1 as n —> oo. 

1.12 In Example 1.7, graph the risk functions of X/n and the minimax estimator (1.11) 
for n = 1, 4, 9, 16, and indicate the relative positions of the two graphs for large values 
of n. 

1.13 (a) Find two points 0 < po < p\ < 1 such that the estimator (1.11) forn = 1 is 
Bayes with respect to a distribution A for which P A (p = po ) + P A (p = p\)= 1. 

(b) For n = 1, show that (1.11) is a minimax estimator of p even if it is known that 
Po< P < Pi- 

(c) In (b), find the values p 0 and p\ for which p, — p 0 is as small as possible. 

1.14 Evaluate (1.16) and show that its maximum is 1 — a. 

1.15 Let X = 1 or 0 with probabilities p and q, respectively, and consider the estimation 
of p with loss = 1 when \d — p\ > 1/4, and 0 otherwise. The most general randomized 
estimator is S = U when X = 0, and S = V when X = 1 where U and V are two random 
variables with known distributions. 

(a) Evaluate the risk function and the maximum risk of S when U and V are uniform 
on (0, 1/2) and (1/2, 1), respectively. 
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(b) Show that the estimator S of (a) is minimax by considering the three values p = 0, 

1 / 2 , 1 . 


[Hint: (b) The risk at p = 0, 1/2, 1 is, respectively, P(U > 1/4), 1/2 [P{U < 1/4) + 
P(V > 3/4)], and P( V < 3/4)]. 

1.16 Show that the problem of Example 1.8 remains invariant under the transformations 
X' = n — X, p' = I - p. d' = 1 - d. 


This illustrates that randomized equivariant estimators may have to be considered when 
G is not transitive. 

1.17 Let r A be given by (1.3). If r A = oo for some A, show that any estimator S has 
unbounded risk. 

1.18 In Example 1.9, show that no linear estimator has constant risk. 

1.19 Show that the risk function of (1.22) depends on p\ and p 2 only through p\ + p 2 
and takes on its maximum when pi + p 2 = 1. 

1.20 (a) In Example 1.9, determine the region in the (p \, p 2 ) unit square in which (1.22) 
is better than the UMVU estimator of p 2 — pi for m = n = 2, 8, 18, and 32. 

(b) Extend Problems 1.11 and 1.12 to Example 1.9. 

1.21 In Example 1.14, show that X is minimax for the loss function (d — 0) 2 /o 2 without 
any restrictions on a. 

1.22 (a) Verify (1.37). 

(b) Show that equality holds in (1.39) if and only if P(Y, = 0) + P(Y, = 1) = 1. 

1.23 In Example 1.16(b), show that for any k > 0, the estimator 


S = 


V” i X k + 1 

1 + \fn n ‘ 2(1 + sfn) 


is a Bayes estimator for the prior distribution A over Tq for which (1.36) was shown to 
be Bayes. 

1.24 Let Xj (i = 1and Yj (j = 1,..., n) be independent with distributions F 

and G, respectively. If F(l) — F(0) = G(l) — G( 0) = 1 but F and G are otherwise 

unknown, find a minimax estimator for E(Yj) — F(Y, ) under squared error loss. 

1.25 Let Xi (i = 1, • • •, n) be iid with unknown distribution F. Show that 

No. of Xi <01 1 

S = - -J-= -- +-— 

*Jn 1 + V" 2(1 + v«) 

is minimax for estimating F(0) = P(Xj < 0) with squared error loss. [Hint: Consider 
the risk function of <5.] 

1.26 Let Xi ,.... X m and Y lr ... ,Y n be independently distributed as A(f, cr 2 ) and /V(;;, r 2 ), 
respectively, and consider the problem of estimating A = r/ — t; with squared error loss. 

(a) If a and r are known, Y — X is minimax. 

(b) If a and r are restricted by cr 2 < A and r 2 < B, respectively (A, B known and 

finite), Y — X continues to be minimax. 

1.27 In the linear model (3.4.4), show that Ea;f; (in the notation of Theorem 3.4.4) is 

minimax for estimating 6 = with squared error loss, under the restriction a 2 < M. 

[Hint: Treat the problem in its canonical form.] 

1.28 For the random variable X whose distribution is (1.42), show that x must satisfy the 
inequalities stated below (1.42). 
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1.29 Show that the estimator defined by (1.43) 

(a) has constant risk, 

(b) is Bayes with respect to the prior distribution specified by (1.44) and (1.45). 

1.30 Show that for fixed X and n, (1.43) -*■ (1.11) as N — > oo. 

1.31 Show that var(F) given by (3.7.6) takes on its maximum value subject to (1.41) 
when all the a ’s are 0 or 1. 

1.32 (a) IfR(p, S)isgivenby(1.49),showthatsupP(p, 8)-4(l+*Jn) 2 —>■ lasrc—>-oo. 
(b) Determine the smallest value of n for which the Bayes estimator of Example 1.18 

satisfies (1.48) for r = 1 and b = 5, 10, and 20. 

1.33 (Efron and Morris 1971) 

(a) Show that the estimator S of (1.50) is the estimator that minimizes |<5 — cx\ subject 
to the constraint |5 — x\ < M. In this sense, it is the estimator that is closest to a 
Bayes estimator, cx, while not straying too far from a minimax estimator, x. 

(b) Show that for the situation of Example 1.19, R(9, S ) is bounded for S of (1.50). 

(c) For the situation of Example 1.19, S of (1.50) satisfies sup e R(9, S) = (1/n) + M 2 . 


Section 2 

2.1 Lemma 2.1 has been extended by Berger (1990a) to include the case where the 
estimand need not be restricted to a finite interval, but, instead, attains a maximum or 
minimum at a finite parameter value. 

Lemma 8.1 Let the estimand g(6) be nonconstant with global maximum or minimum 
at a point 9* e Q, for which f(x\9*) > 0 a.e. (with respect to a dominating measure 
fi), and let the loss L(9, d) satisfy the assumptions of Lemma 2.1. Then, any estimator 
S taking values above the maximum of g(9), or below the minimum, is inadmissible. 

(a) Show that if 9* minimizes g(9), and if g(x) is an unbiased estimator of g(9), then 
there exists e > 0 such that the set A e = (x e X : g(x) < g(9) — e) satisfies 
P(A e ) > 0. A similar conclusion holds if g(9*) is a maximum. 

(b) Suppose g(9*) is a minimum. (The case of a maximum is handled similarly.) Show 
that the estimator 

, f g(x) if g(x) > g( 0 *) 

\ g(9*) if g(x) < g{9*) 
satisfies R(S, 9) — R(g(x), 9) < 0. 

(c) For the situation of Example 2.3, apply Lemma 8.1 to establish the inadmissibility 
of the UMVU estimator of o\. Also, explain why the hypotheses of Lemma 8.1 are 
not satisfied for the estimation of a 2 

2.2 Determine the Bayes risk of the estimator (2.4) when 9 has the prior distribution 
N(n, r 2 ). 

2.3 Prove part (d) in the second proof of Example 2.8. that there exists a sequence of 
values 9j —> — oo with b'(9i) —r 0 . 

2.4 Show that an estimator aX + b (0 < a < 1) of E e (X) is inadmissible (with squared 
error loss) under each of the following conditions: 


(a) if E e (X) > 0 for all 9, and b < 0; 
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(b) if E e (X) < k for all 9, and ak + b > k. 


[Hint: In (b), replace X by X' = k — X and aX + b by k — (aX + b) = aX' + k — b — ak, 
respectively and use (a).] 

2.5 Show that an estimator [1/(1 + A) + e]X of Eg(X) is inadmissible (with squared error 
loss) under each of the following conditions: 

(a) if var g(X)/Eg(X) > X > 0 and g > 0, 

(b) if vaig(X)/Ej(X) < X and s < 0. 

[Hint: (a) Differentiate the risk function of the estimator with respect to e to show that 
it decreases as g decreases (Karlin 1958).] 

2.6 Show that if var g (X)/Eg(X) > X > 0,anestimator[l/(l+A.)+g]X+f> is inadmissible 
(with squared error loss) under each of the following conditions: 

(a) if E„(X) > 0 for all 9, b > 0 and g > 0; 

(b) if E e (X) < 0 for all 9, b < 0 and g > 0 (Gupta 1966). 

2.7 Brown (1986a) points out a connection between the information inequality and the 
unbiased estimator of the risk of Stein-type estimators. 


(a) Show that (2.7) implies 


[\ + b'(9)f , 1 2b'(9) , 

R(Q, S ) > - —E + b 2 (9 ) > - + —— + b\9 ) 

n n n 


and, hence, if R(9, 5) < R(6, X), then ^ + b 1 < 0. 

(b) Show that a nontrivial solution b(9) would lead to an improved estimator x — g(x), 
for p = 1, in Corollary 4.7.2. 


2.8 A density function f{x\9) is variation reducing of order n + 1 (Vf?„ + i) if, for any 
function g(x) with k ( k < n) sign changes (ignoring zeros), the expectation E e g(X ) = 
f g(x)f(x\9)dx has at most k sign changes. If Egg(X) has exactly k sign changes, they 
are in the same order. 

Show that f{x\9) is VR 2 if and only if it has monotone likelihood ratio. (See TSH2, 
Lemma 2, Section 3.3 for the “if” implication). 

Brown et al. (1981) provide a thorough introduction to this topic, including V R charac¬ 
terizations of many families of distributions (the exponential family is V Roo, as is the xl 
with v the parameter, and the noncentral /i;(k) in X). There is an equivalence between 
V R„ and T P„, Karlin’s (1968) total positivity of order n, in that V R n = T P„. 

2.9 For the situation of Example 2.9, show that: 

(a) without loss of generality, the restriction 9 € [a , b ] can be reduced to 9 e [— m , m], 
m > 0 . 

(b) If A is the prior distribution that puts mass 1 /2 on each of the points ±m, then the 
Bayes estimator against squared error loss is 

g/nnx _ g—mnx 

S a (T) = m - = m tanh (mnx). 

gmnx _|_ g—mnx 

(c) For m < 1 /^fn , 

max R(9, 3(A)) = max { R(-m, S A (X)), R(m, <5 a (A))| 
and hence, by Corollary 1.6, <5 A is minimax. 
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[Hint. Problem 2.8 can be used to show that the derivative of the risk function can 
have at most one sign change, from negative to positive, and hence any interior 
extrema can only be a minimum.] 

(d) For m > 1.05 /\/n, 8 A of part (b) is no longer minimax. Explain why this is so and 
suggest an alternate estimator in this case. 

[ Hint: Consider R( 0, <5 A )]. 

2.10 For the situation of Example 2.10, show that: 

(a) max R(9, aX + b) = max[R(— in, aX + b), R(m, aX + b)}. 

0e[—m,m ] 

(b) The estimator a*X, with a* = m 2 /(- + m 2 ), is the linear minimax estimator for all 
m with minimax risk a*/n. 

(c) X is the linear minimax estimator for m = oo. 

2.11 Suppose X has distribution F ^ and Y has distribution G,, where f and vary 
independently. If it is known that ?/ = r] 0 , then any estimator 8(X, Y) can be improved 
upon by 

8*{x) = Ey8(x. Y) = fs(x, y)dG m (y). 

[Hint: Recall the proof of Theorem 1.6.1.] 

2.12 In Example 2.13, prove that the estimator a Y +b is inadmissible when a > l/O' + l). 
[Hint: Problems 2.4-2.6] 

2.13 Let Xi,..., X n be iid according to a N( 0, a 2 ) density, and let S 2 = We 

are interested in estimating a 2 under squared error loss using linear estimators cS 2 + d, 
where c and d are constants. Show that: 

(a) admissibility of the estimator aY + b in Example 2.13 is equivalent to the admissi¬ 
bility of cS 2 + d, for appropriately chosen c and d. 

(b) the risk of cS 2 + d is given by R(cS 2 + d, a 2 ) = lnc 2 a 2 + [(nc — l)cr 2 + d] 2 

(c) for d = 0, R(cS 2 , a 2 ) < R( 0, a 2 ) when c < 2/{n + 2), and hence the estimator 
aY + b in Example 2.13 is inadmissible when a = b = 0. 

[This exercise illustrates the fact that constants are not necessarily admissible estimators.] 

2.14 For the situation of Example 2.15, let Z = X/S. 

(a) Show that the risk, under squared error loss, of <5 = <p(z)s 2 is minimized by taking 

<p(z) = <p;jz) = E(S 2 /a 2 \z)/E({S 2 /a 2 f\z). 

(b) Stein (1964) showed that <p* a (z) < (p^ ,(z) for every fi,cr. Assuming this is so, 
deduce that (p s (Z)S 2 dominates [1 /(« + 1)]S 2 in squared error loss, where 

<Ps(z) = min | (fig j(z), — 

I ' n + I 


(c) Show that ,(z) = (1 + z 2 )/(n + 2), and, hence, <p s (Z)S 2 is given by (2.31). 

(d) The best equivariant estimator of the form <p(Z)S 2 was derived by Brewster and 
Zidek (1974) and is given by 

, , E(S 2 \Z < z) 

VBziz) = E (S*\Z < z)’ 
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where the expectation is calculated assuming n = 0 and a = 1. Show that < p BZ (Z)S 2 
is generalized Bayes against the prior 

n(fi,a)= — / u- l/1 {\+u)- l e- unl1 ja dud [i da. 

a Jo 

[Brewster and Zidek did not originally derive their estimator as a Bayes estimator, 
but rather first found the estimator and then found the prior. Brown (1968) consid¬ 
ered a family of estimators similar to those of Stein (1964), which took different 
values depending on a cutoff point for z 2 . Brewster and Zidek (1974) showed that 
the number of cutoff points can be arbitrarily large. They constructed a sequence of 
estimators, with decreasing risks and increasingly dense cutoffs, whose limit was 
the best equivalent estimator.] 

2.15 Show the equivalence of the following relationships: (a) (2.26) and (2.27), (b) (2.34) 
and (2.35) when c = «J(n — 1 )/(« + 1), and (c) (2.38) and (2.39). 

2.16 In Example 2.17, show that the estimator aX/n + b is inadmissible for all (a, b) 
outside the triangle (2.39). 

[Hint: Problems 2.4-2.6.J 

2.17 Prove admissibility of the estimators corresponding to the interior of the triangle 
(2.39), by applying Theorem 2.4 and using the results of Example 4.1.5. 

2.18 Use Theorem 2.14 to provide an alternative proof for the admissibility of the esti¬ 
mator aX + b satisfying (2.6), in Example 2.5. 

2.19 Determine which estimators aX + b are admissible for estimating E(X) in the 
following situations, for squared error loss: 

(a) X has a Poisson distribution. 

(b) X has a negative binomial distribution (Gupta 1966). 

2.20 Let X have the Poisson(A) distribution, and consider the estimation of A. under the 
loss (d — A) 2 /A with the restriction 0 < A <m, where m is known. 


(a) Using an argument similar to that of Example 2.9, show that X is not minimax, 
and a least favorable prior distribution must have a set w A [of (1.5)] consisting of 
a finite number of points. 

(b) Let A„ be a prior distribution that puts mass a,-, i = 1,..., k, at parameter points 
bj,i = 1, ..., k. Show that the Bayes estimator associated with this prior is 


& K °{x) 


1 

E(\- l \x) 


Eli a i b i e bi 

zLo.b-r'e-*' 


(c) Let mo be the solution to m = e~ m (mo » .57). Show that for 0 < A < m, m < mo 
a one-point prior (a, = 1, bi = m ) yields the minimax estimator. Calculate the 
minimax risk and compare it to that of X. 

(d) Let mi be the first positive zero of (1 +S A (m)) 2 = 2 + m 2 /2, where A is a two-point 
priorfai = a,b t =0;a2 = 1— a.b^ = m). Show that for 0 < A < m,m 0 < m < mi, 
a two-point prior yields the minimax estimator (use Corollary 1.6). Calculate the 
minimax risk and compare it to that of X. 

[As m increases, the situation becomes more complex and exact minimax solutions 
become intractable. For these cases, linear approximations can be quite satisfactory. 
See Johnstone and MacGibbon 1992, 1993.] 
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2.21 Show that the conditions (2.41) and (2.42) of Example 2.22 are not only sufficient 
but also necessary for admissibility of (2.40). 

2.22 Let X and Y be independently distributed according to Poisson distributions with 
E(X) = | and E{Y) = r], respectively. Show that aX +£>T + cis admissible for estimating 
£ with squared error loss if and only if either 0 < a < \,b >0, c >0ora = \,b = c = Q 
(Makani 1972). 

2.23 Let X be distributed with density ^f}(9)e ex e~ M , |0| < 1. 

(a) Show that 0(0) = 1 - 0 2 . 

(b) Show that aX + b is admissible for estimating E g (X) with squared error loss if and 
only if 0 < a < 1/2. 

[Hint: (b) To see necessity, let S = (1/2 + e)X + b (0 < e < 1/2) and show that S is 
dominated by S' = (1 — + ae)X + ( b/ot ) for some a with 0 < a < 1/(1 /2 — e).] 

2.24 Let X be distributed as N{6, 1) and let 6 have the improper prior density n{9) = e e 
(—oo < 9 < oo). For squared error loss, the formal Bayes estimator of 6 is X + 1, which 
is neither minimax nor admissible. (See also Problem 2.15.) 

Conditions under which the formal Bayes estimator corresponding to an improper prior 
distribution for 9 in Example 3.4 is admissible are given by Zidek (1970). 

2.25 Show that the natural parameter space of the family (2.16) is (—oo, oo) for the 
normal (variance known), binomial, and Poisson distribution but not in the gamma or 
negative binomial case. 


Section 3 

3.1 Show that Theorem 3.2.7 remains valid for almost equivariant estimators. 

3.2 Verify the density (3.1). 

3.3 In Example 3.3, show that a loss function remains invariant under G if and only if it 
is a function of (d — 9)*. 

3.4 In Example 3.3, show that neither of the loss functions [(rf — 0)**] 2 or |(</ — 0)**| is 
convex. 

3.5 Let Y be distributed as G(y — r]). If T = [Y] and X = Y — T, find the distribution of 
X and show that it depends on r/ only through ^ — [?;]. 

3.6 (a) IfXi, ..., X n are iid with density f(x — 9), show that theMRE estimator against 

squared error loss [the Pitman estimator of (3.1.28)] is the Bayes estimator against 
right-invariant Haar measure. 

(b) If X[ . X„ are iid with density 1 /r f[(x — /r)/r], show that: 

(i) Under squared error loss, the Pitman estimator of (3.1.28) is the Bayes esti¬ 
mator against right-invariant Haar measure. 

(ii) Under the loss (3.3.17), the Pitman estimator of (3.3.19) is the Bayes estimator 
against right-invariant Haar measure. 

3.7 Prove formula (3.9). 

3.8 Prove (3.11). 

[Hint: In the term on the left side, liminf can be replaced by lint. Let the left side of 
(3.11) be A and the right side B, and let A N = inf h(a, b ), where the inf is taken over 
a < —N, b > N, N = 1, 2, ..., so that A N A. There exist ( a N , b N ) such that 
|/i(<7iv, bp/) — Aw| < l/N. Then. h(ap/, bp/) -*■ A and A > B.] 
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3.9 In Example 3.8, let h(9) be the length of the path 9 after cancellation. Show that h 
does not satisfy conditions (3.2.11). 

3.10 Discuss Example 3.8 for the case that the random walk instead of being in the plane 
is (a) on the line and (b) in three-space. 

3.11 (a) Show that the probabilities (3.17) add up to 1. 

(b) With p k given by (3.17), show that the risk (3.16) is infinite. 

[Hint: (a) l/k(k+ 1) = (1 /k) - l/(k+ 1).] 

3.12 Show that the risk R(9, 8 ) of (3.18) is finite. 

[Hint: R(9,8) < Z k >M\M\l/(k + 1) < E c<k<d \/(k + 1) < f* +1 dx/x, where c = 
M\9\/(M + 1) and d = M\9\/(M — 1). The reason for the second inequality is that 
values of k outside (c, d) make no contribution to the sum.] 

3.13 Show that the two estimators 8* and <5**, defined by (3.20) and (3.21), respectively, 
are equivariant. 

3.14 Prove the relations (3.22) and (3.23). 

3.15 Let the distribution of X depend on parameters 9 and 9, let the risk function of an 
estimator 8 = 8(x) of 9 be R(0, 9',8), and let r(9, 8) = fR(9, 9;8)dP(9) for some dis¬ 
tribution P. If 8 0 minimizes sup fl r(9, 8) and satisfies sup,, r(9 , <5 0 ) = sup,, ,, R(9 , 9;8 0 ), 
show that So minimizes sup,, & R(9 , 9; 8). 


Section 4 

4.1 In Example 4.2, show that an estimator 5 is equivariant if and only if it satisfies (4.11) 
and (4.12). 

4.2 Show that a function p. satisfies (4.12) if and only if it depends only on EX 2 . 

4.3 Verify the Bayes estimator (4.15). 

4.4 Let Xi be independent with binomial distribution b(pj ,«,),/ = 1. For estimat¬ 
ing p = (pi . p r ) with average squared error loss (4.17), find the minimax estimator 

of p, and determine whether it is admissible. 

4.5 Establishing the admissibility of the normal mean in two dimensions is quite difficult, 
made so by the fact that the conjugate priors fail in the limiting Bayes method. Let 

X~N 2 (0,I) and L(6,8) = \9 - 8\ 2 . 

The conjugate priors are 6 ~ A(>(0, r 2 /), r 2 > 0. 

(a) For this sequence of priors, verify that the limiting Bayes argument, as in Example 
2.8, results in inequality (4.18), which does not establish admissibility. 

(b) Stein (in James and Stein 1961), proposed the sequence of priors that works to 
prove X is admissible by the limiting Bayes method. A version of these priors, 
given by Brown and Hwang (1982), is 

if \6\ < 1 
if 1 < |0| <n 
if \0 1 > n 

for n = 2, 3, .... Show that 8 gn (x) —>■ x a.e. as n —> oo. 


gn(0) = 


1 - 


log \0\ 

log/! 
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(c) A special case of the very general results of Brown and Hwang (1982) state that for 
the prior 7t„(0 ) 2 g(0), the limiting Bayes method (Blyth’s method) will establish the 
admissibility of the estimator <5 g (jr) [the generalized Bayes estimator against g(0)] 
if 

r g(Q)do 

i{0:|0|>i) |0| 2 [max{log |0|, log2}] 2 < °°' 

Show that this holds for g(0) = 1 and that & g {x) = x, so x is admissible. 

[Stein (1956b) originally established the admissibility of X in two dimensions using an 
argument based on the information inequality. His proof was complicated by the fact that 
he needed some additional invariance arguments to establish the result. See Theorem 
7.19 and Problem 7.19 for more general statements of the Brown/Hwang result.] 

4.6 Let X[, X 2 , ..., X r be independent with X, ~ IV (0,-, 1). The following heuristic 
argument, due to Stein (1956b), suggests that it should be possible, at least for large r 
and hence large |0|, to improve on the estimator X = (Xi, X 2 , ..., X r ). 

(a) Use a Taylor series argument to show 

|x| 2 = r + |0| 2 + O p [(r + |0| 2 ) 1/2 ], 

so, with high probability, the true 6 is in the sphere [0 : |0| 2 < |x| 2 — r}. The 
usual estimator X is approximately the same size as 0 and will almost certainly be 
outside of this sphere. 

(b) Part (a) suggested to Stein an estimator of the form <5(x) = [1 — 0(|x[ 2 )]x. Show 
that 

|0 - S(x)| 2 = (1 -h) 2 \x- 0| 2 - 2/7(1 -h)0\x- 0) + h 2 \0\ 2 . 

(c) Establish that 0'(x — 0)/|0| = Z ~ N( 0, 1), and |x — 0| 2 « r, and. hence, 

|0 - <5(x)| 2 w (1 - h) 2 r + h 2 \0\ 2 + O p [(r + |0| 2 ) 1/2 ]. 

(d) Show that the leading term in part (c) is minimized at h = r/(r + |01 2 ), and since 
|x| 2 ~ r + |01 2 , this leads to the estimator <5(x) = ^1 — j^p-^ x of (4.20). 

4.7 If S 2 is distributed as x 2 , use (2.2.5) to show that E(S~ 2 ) = l/(r — 2). 

4.8 In Example 4.7, show that 1Z is nonsingular for p l and p 2 and singular for p 3 and p 4 . 

4.9 Show that the function p 2 of Example 4.7 is convex. 

4.10 In Example 4.7, show that X is admissible for (a) p 2 and (b) p 4 . 

[Hint: (a) It is enough to show that Xi is admissible for estimating 9\ with loss (d i — 0() 2 . 
This can be shown by letting 0 2 , ..., 6 r be known, (b) Note that X is admissible minimax 
for 0 = (0j, ..., 0,.) when 0i = • • • = 0 r .] 

4.11 In Example 4.8, show that X is admissible under the assumptions (ii)(a). 

[Hint: 

i. If v(t) > 0 is such that 

f — — e~'~ l2x ~ dt < oo, 

J m 

show that there exists a constant k(z) for which 

K(0) = k(T) [Su(0;)] exp ^-pLs0 2 J /n v(9j) 
is a probability density for 0 = (0|,..., 6 r ). 
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ii. If the X t are independent N{6j, 1) and 6 has the prior X z (8), the Bayes estimator 
of 6 with loss function (4.27) is r 2 X/(l + t 2 ). 

iii. To prove X admissible, use (4.18) with X z (6) instead of a normal prior.] 

4.12 Let £ be a family of loss functions and suppose there exists L 0 e £ and a minimax 
estimator S 0 with respect to L 0 such that in the notation of (4.29), 

sup R l (0, S 0 ) = sup R Lo (9 , <5 0 ). 

L,e e 

Then, S 0 is minimax with respect to £; that is, it minimizes sup L e Ri(9 , 8). 

4.13 Assuming (4.25), show that E = 1 — [(r — 2) 2 /r|X — yu.| 2 ] is the unique unbiased 
estimator of the risk (5..4.25), and that E is inadmissible. [The estimator E is also 
unbiased for estimation of the loss L(0, S). See Note 9.5.] 

4.14 A natural extension of risk domination under a particular loss is to risk domination 
under a class of losses. Hwang (1985) defines universal domination of S by S' if the 
inequality 

E 0 L(\0 - <5'(X)|) < E e L(\6 - <5(X)|) for all 6 

holds for all loss functions £(•) that are nondecreasing, with at least one loss function 
producing nonidentical risks. 

(a) Show that S' universally dominates S if and only if it stochastically dominates S, 
that is, if and only if 

PeW ~ «'(X)| > k) < P e (\0 - <5(X)| > k) 
for all k and 6 with strict inequality for some 8. 

[Hint: For a positive random variable T, recall that EY = P(Y > t)dt. Al¬ 
ternatively, use the fact that stochastic ordering on random variables induces an 
ordering on expectations. See Lemma 1, Section 3.3 of TSH2.] 

(b) For X ~ N r (8 , /), show that the James-Stein estimator S c (x) = (1 — c/|x| 2 )x does 
not universally dominate x. [From (a), it only need be shown that Pg(\8 — <5 r (X)| > 
k) > Pg(\8 — X| > k) for some 6 and k. Take 6=0 and find such a k.\ 

Hwang (1985) and Brown and Hwang (1989) explore many facets of universal domi¬ 
nation. Hwang (1985) shows that even 5 + does not universally dominate X unless the 
class of loss functions is restricted. 

We also note that although the inequality in part (a) may seem reminiscent of the “Pitman 
closeness” criterion, there is really no relation. The criterion of Pitman closeness suffers 
from a number of defects not shared by stochastic domination (see Robert et al. 1993). 


Section 5 


5.1 Show that the estimator S c defined by (5.2) with 0<c=l — Aclis dominated by 
any Sj with \d — 1| < A. 

5.2 In the context of Theorem 5.1, show that 


Eg 


" 1 " 

< Eq 

" 1 " 

UX| 2 J 


UxpJ 


< 00 . 


[Hint: The chi-squared distribution has monotone likelihood ratio in the noncentrality 
parameter.] 
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5.3 Stigler (1990) presents an interesting explanation of the Stein phenomenon using a 
regression perspective, and also gives an identity that can be used to prove the minimaxity 
of the James-Stein estimator. For X ~ (N r 6, /), and <5 r (x) = ^1 — ^ ^ x: 


(a) Show that 

X'fl + (c/2) _ - 
|Xp 

(b) The expression in square brackets is increasing in c. Prove the minimaxity of S c for 
0 < c < 2{r — 2) by establishing Stigler’s identity 


E e \9 -S c (X)\ 2 = r-2cE 0 




"X'6 >+r- 2" 

. jxp . 


[Hint: Part (b) can be established by transforming to polar coordinates and directly 
integrating, or by writing = x and using Stein’s identity.] 

5.4 (a) Prove Theorem 5.5. 

(b) Apply Theorem 5.5 to establish conditions for minimaxity of Strawderman’s (1971) 
proper Bayes estimator given by (5.10) and (5.12). 

[Hint: (a) Use the representation of the risk given in (5.4), with g(x) = c(|x|)(r — 2)x/|x| 2 . 
Show that R(6, 8) can be written 


(r — 2) 2 

R(0,8) = 1 - 2 - -E 0 


c(|X[)(2 - c(|X[)) 
|X[ 2 


2(r — 2) SX i; ^c(|X|) 

~^— Ee jxp 


and an upper bound on R(6,8) is obtained by dropping the last term. It is not necessary 
to assume that c(-) is differentiable everywhere; it can be nondifferentiable on a set of 
Lebesgue measure zero.] 

5.5 For the hierarchical model (5.11) of Strawderman (1971): 


(a) Show that the Bayes estimator against squared error loss is given by £"(0 |x) = 
[1 — £(k|x)]x where 


E(X |x) = 


/ 0 ‘ vr-^ e -u2m 2 dk 

/ 0 ‘ Xrl2-a e -ll2m 2 d\. ' 


(b) Show that £(k[x) has the alternate representations 


£(k|x) 

£'(A|x) 


r - 2fl + 2 PiXr-Ta+A < l X [ 2 ) 

|X | 2 P(Xr-2a + 2 < ’ 

r -2a+ 2 2 e - 1/2|x|2 

-^-i- dX, 

|X | 2 |x | 2 / Q ‘ Xr/2-a e -imrf 


and hence that a = 0 gives the estimator of (5.12). 

(c) Show that IxpF’fklx) is increasing in |x| 2 with maximum r — 2a + 2. Flence, the 
Bayes estimator is minimax if r — 2a + 2 < 2(r — 2)orr > 2(3 —a). For 0 < a < 1, 
this requires r > 5. 


[Berger (1976b) considers matrix generalizations of this hierarchical model and derives 
admissible minimax estimators. Proper Bayes minimax estimators only exist if r > 5 
(Strawderman 1971); however, formal Bayes minimax estimators exist for r = 3 and 4.] 
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5.6 Consider a generalization of the Strawderman (1971) hierarchical model of Problem 
5.5: 

X|0 ~ N(0, /), 

0\X ~ ?V(0, A _1 (l — A)/), 

X ~ tt(L). 


(a) Show that the Bayes estimator against squared error loss is [1 — F(k|x)]x, where 


ft X r ' 2+1 e- l ' 2 W 2 n(X)dX 
X r / 2 e-V 2 W 2 n(X)dX ' 


(b) Suppose X ~ beta(a, /l), with density 


n(X) = 


r> + p) 
r»r(/i) 


x“-‘(i - A./" 1 . 


Show that the Bayes estimator is minimax if /} > 1 and 0 < a < (r — 4)/2. 

[Hint: Use integration by parts on £(/.|x), and apply Theorem 5.5. These estimators 
were introduced by Faith (1978).] 

(c) Let t = 7. -1 (l — X), the prior precision of 6. If X ~ betafo', /l), show that the density 
of t is proportional to f“ _1 /( 1 + fr’ 43 , that is, t ~ F 2a , 2 ^, the F-distribution with 
2a and 2/1 degrees of freedom. 

[Strawderman’s prior of Problem 5.5 corresponds to fi = 1 and 0 < a < 1. If we 
take a = 1/2 and f) = 1, then t ~ Fj^.] 

(d) Two interesting limiting cases are a = 1, f) = 0 and a = 0, /l = 1. For each case, 
show that the resulting prior on t is proper, and comment on the minimaxity of the 
resulting estimators. 


5.7 Faith (1978) considered the hierarchical model 

X|0 - N(0, /), 

t ~ Gammato, b). 


that is. 


n(t) = 


___ t a - 1 e~ ,lb 

r (a)b a 


(a) Show that the marginal prior for 6, unconditional on t, is 

?r(0) a (2/Z> + |0| 2 )- (a+r/2) , 

a multivariate Student’s /-distribution. 

(b) Show that a < —1 is a sufficient condition for JT 3 1 > 0 and, hence, is 

a sufficient condition for the minimaxity of the Bayes estimator against squared 
error loss. 

(c) Show, more generally, that the Bayes estimator against squared error loss is minimax 
if a < (r — 4)/2 and a < 1/6 + 3. 

(d) What choices of a and 6 would produce a multivariate Cauchy prior for jr(0)? Is 
the resulting Bayes estimator minimax? 
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5.8 (a) Let X ~ N{8 ,2) and consider the estimation of 8 under the loss L(0,8) = 
(8 — 8)'(0 — 8). Show that R(8,X) = tr E, the minimax risk. Hence, X is a 
minimax estimator. 


(b) Let X ~ N{6, I) and consider estimation of 0 under the loss L(8,8) = (0 — 
8)’ Q(8 — 8), where Q is a known positive definite matrix. Show that R(8 , X) = tr Q, 
the minimax risk. Hence, X is a minimax estimator. 

(c) Show that the calculations in parts (a) and (b) are equivalent. 

5.9 In Theorem 5.7, verify 


c(|X| 2 ) 

' ixp 


X'(8 -X) = E, 




I ixp 


tr(E) - 2 


c([X| 2 ) 

|X| 4 


X'EX + 2 


c'(|X| 2 ) 

|X| 2 


X'EX 


[Hint: There are several ways to do this: 


(a) Write 


c(|X| 2 ) , 

Eg ' X'(6> - X) : 

2 


c( Y'Y , 

£«-LtYE(( — Y) 


Y'Y 


E*{w£ 

< i j 


YjerjiUh ~ Yi) 


where E = {or y } and Y = E-‘ /2 X ~ N(Y.~ l/2 8, I) = N(r ], /). Now apply Stein’s 
lemma. 

(b) Write E = PDF', where P is an orthogonal matrix (P'P = /) and D=diagonal 
matrix of eigenvalues of E. D = diagonaljt/;}. Then, establish that 


c(|X| 2 ) 
’ |X[ 2 


X’(8 - X) = Y, E t 


cCLjdtZf) 

E.d.Z; 


djZjtf - Z.j) 


where Z = PE 1/2 X and y * = P E 1/2 0. Now apply Stein’s lemma. 

5.10 In Theorem 5.7, show that condition (i) allows the most shrinkage when E = or 2 /, 
for some value of o 2 . That is, show that for all r x r positive definite E, 

tr E tr a 2 1 

max --=--— = r. 

^ ^-max (S) 

^•max (a 2 /) 

[Hint: Write k lr ^ E) = J]L,7k max , where the A, ’s are the eigenvalues of E.] 

5.11 The estimation problem of (5.18), 


X - N(8, E) 

L(6,8) = (6 — S)'Q(8 — 8), 


where both E and Q are positive definite matrices, can always be reduced, without loss 
of generality, to the simpler case, 

Y ~ N(i], I) 

L(n,8 *) = (»/ — 8*)'D q *(r) — 8*), 

where D q * is a diagonal matrix with elements (g*, ..., q*), using the following argument. 
Define R = E 1/2 S. where E l/2 is a symmetric square root of E (that is, E 1/2 E 1/2 = £), 
and B is the matrix of eigenvectors of E 1/2 gE 1/2 (that is, S'E 1/2 0E 1/2 6 = D q *). 

(a) Show that R satisfies 


R"S l R = I, R'QR = D q t 
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(b) Define Y = R *X. Show that Y ~ N(q. I), where r] = R l 0. 

(c) Show that estimations problems are equivalent if we define 5*(Y) = 
R~ l 8(RY). 

[Note'. If E has the eigenvalue-eigenvector decomposition P'XP = D = 
diagonal^, ■ • •, d r ), then we can define E 1/2 = PD l/1 P', where D l/2 is a diagonal 
matrix with elements *fdi. Since E is positive definite, the d,’s are positive.] 

5.12 Complete the proof of Theorem 5.9. 


(a) Show that the risk of <5 (x) is 

R(0,8) = E e [(0 -X)'Q(0 -X)] 
'c(|X| 2 l . 


-2 E, 


+ Ea 


|X| 2 
r c 2 (|X| 2 ) 


X'Q(O-X) 


|X| 4 


x'ex 


where E„(6 - X)'Q(0 - X) = tr (Q). 
(b) Use Stein’s lemma to verify 
=. c(|X| 2 ) v . 


|X| 2 


-X'Q(0-X) 


| c(|X| 2 ) 


tr (Q) - 2 


c(|X| 2 l , 


X’QX + 2 


c'(|X| 2 ) v , 


| |X| 2 " |X| 4 |X| 2 

Use an argument similar to the one in Theorem 5.7. 


X'<2X . 


[Hint: Write 


E e 


c(|X| 2 ) 

|X| 2 


X'Q(0 -X) = J2 E e 


c([X| 2 ) 

|X| 2 


J^XjqjM-XO 


and apply Stein’s lemma.] 

5.13 Prove the following ’’generalization” of Theorem 5.9. 


Theorem 8.2 LetX ~ N(6, E). An estimator of the form (5.13) is minimax against the 
loss L(6 , S) = (0 — 8)'Q(0 — 8), provided 

(i) 0 < c(|x| 2 ) < 2[tr(e*)A max (2*)] - 4. 

(ii) the function c(-) is nondecreasing, 
where Q* = E 1/2 gE 1/2 . 


5.14 Brown (1975) considered the performance of an estimator against a class of loss 
functions 

C(C) = Jl : L(0,8) = iZ c i^i ~ (D, ...,c r )eC 

for a specified set C, and proved the following theorem. 

Theorem 8.3 For X ~ N r (0 , /), there exists a spherically symmetric estimator 8, that 
is, <5(x) = [1 —/r(|x| 2 )]x, where /t(|x| 2 ) f 0, such that R(6, 5) < R(0. X) for all L e C(C) 
if for all (ci. c r ) e C, the inequality Yl'j =i c < > 2 Ck holds for k = 1,..., r. 
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Show that this theorem is equivalent to Theorem 5.9 in that the above inequality is 
equivalent to part (i) of Theorem 5.9, and the estimator (5.13) is minimax. 

[Hint: Identify the eigenvalues of Q with ci, ..., c r .] 

Bock (1975) also establishes this theorem; see also Shinozaki (1980). 

5.15 There are various ways to seemingly generalize Theorems 5.5 and 5.9. However, if 
both the estimator and loss function are allowed to depend on the covariance and loss 
matrix, then linear transformations can usually reduce the problem. 

Let X ~ N r (6, E), and let the loss function be L{8, S) = (6 — 8)' Q(6 — 8), and consider 
the following “generalizations” of Theorems 5.5 and 5.9. 


(a) 

S(x)= (l - 

(b) 

<5(x) = ^ 1 - 

(c) 

Six) = ( l - 


c(x'E-‘x)\ 
x'E-‘x J X ’ 


Q = E- 1 , 


c(x'gx) \ 
x'Qx J X 


E = / or E = Q, 


c(x' E -1/2 <2 E ~ 1/2 x) \ 
x'E-^gE-i^x / 


In each case, use transformations to reduce the problem to that of Theorem 5.5 or 5.9, 
and deduce the condition for minimaxity of <5. 

[Hint: For example, in (a) the transformation Y = E~ 1/2 X will show that S is minimax 
if 0 < c(-) < 2(r — 2).] 

5.16 A natural extension of the estimator (5.10) is to one that shrinks toward an arbitrary 
known point /z = (/z i,.... /z r ). 


<5,, (x) = n f 

where |x — /x| 2 = E(x,- — /U,) 2 - 


1 - c(S) 


r — 2 " 
|x — /x | 2 _ 


(x - n) 


(a) Show that, under the conditions of Theorem 5.5, <5 M is minimax. 

(b) Show that its positive-part version is a better estimator. 


5.17 Let X ~ N, (6, /). Show that the Bayes estimator of 6, against squared error loss, 
is given by 5(x) = x + V logm(x) where m(x) is the marginal density function and 
V/ = Id/dxtf}. 

5.18 Verify (5.27). 

[Hint: Show that, as a function of |x| 2 , the only possible interior extremum is a minimum, 
so the maximum must occur either at |x| 2 = 0 or |x| 2 = oo.] 

5.19 The property of superharmonicity, and its relationship to minimaxity, is not restricted 
to Bayes estimators. For X ~ N r (0 , /), a pseudo-Bayes estimator (so named, and 
investigated by Bock. 1988) is an estimator of the form 

x + V log m(x) 

where m(x) is not necessarily a marginal density. 

(a) Show that the positive-part Stein estimator 

is a pseudo-Bayes estimator with 

I e -(t/2)l*-Ml 2 if | x _ ^ p < a 

W(X) “ ( (|x - Ml 2 )~ a/2 if |X - Ml 2 > a. 
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(b) Show that, except at the point of discontinuity, if a < r — 2, then ^' =| ^m(x) 
< 0, so m(x) is superharmonic. 

(c) Show how to modify the proof of Corollary 5.11 to accommodate superharmonic 
functions m(x) with a finite number of discontinuities of measure zero. 


This result is adapted from George (1986a, 1986b), who exploits both pseudo-Bayes 
and superharmonicity to establish minimaxity of an interesting class of estimators that 
are further investigated in the next problem. 

5.20 For X\6 ~ N r (6 , /), George (1986a, 1986b) looked at multiple shrinkage estima¬ 
tors, those that can shrink to a number of different targets. Suppose that 6 ~ jt(0) = 
7T,(#), where the a>; are known positive weights, = 1. 


(a) Show that the Bayes estimator against n(6), under squared error loss, is given by 
5*(x) = x + V log m*(x) where m*(x) = Wjmjix) and 


m i (x) 


/ 

Ja 


1 


st (2tr )p' 2 




(b) Clearly, 5* is minimax if m*(x) is superharmonic. Show that 5*(x) is minimax if 
either (i) m;(x) is superharmonic, i = 1,..., k, or (ii) TtfO) is superharmonic, 
i = 1,..., k. [Hint: Problem 1.7.16] 

(c) The real advantage of <5* occurs when the components specify different targets. For 
Pj = CL>jinj(x)/m*(x), let S*(x) = Py^t(x) where 

= + (x-M,) 

and the fij’s are target vectors. Show that <5*(x) is minimax. [Hint: Problem 5.19] 


[George (1986a, 1986b) investigated many types of multiple targets, including multiple 
points, subspaces, and clusters and subvectors. The subvector problem was also con¬ 
sidered by Berger and Dey (1983a, 1983b). Multiple shrinkage estimators were also 
investigated by Ki and Tsui (1990) and Withers (1991).] 

5.21 Let X t , Yj be independent (V(f,-, 1) and N(r]j, 1), respectively (i = 1, .... r; j = 
1 ,...,*). 


(a) Find an estimator of (fi, ..., i- r ; tji, ..., ;; s ) that would be good near §,■ = • • • = 
§ r = £, t)i = ■ ■ ■ = r/ s = r), with § and 7/ unknown, if the variability of the f’s and 
if s is about the same. 

(b) When the loss function is (4.17), determine the risk function of your estimator. 


[Hint: Consider the Bayes situation in which ~ (V(£, A) and rjj ~ N(ri, A). See 
Berger 1982b for further development of such estimators], 

5.22 The early proofs of minimaxity of Stein estimators (James and Stein 1961, Baranchik 
1970) relied on the representation of a noncentral x 2 -distribution as a Poisson sum of 
central x 2 (TSH2, Problem 6.7). In particular, if x 2 W is a noncentral x 2 random variable 
with noncentrality parameter X, then 

£yt(x;W) = E[Eh( X ; +2K )\K] 

where K ~ Poisson(X) and Xr+ik * s a centra l X 2 random variable with r + 2 kdf. Use 
this representation, and the properties of the central x ^distribution, to establish the 
following identities for X ~ N r (9, /) and X = \0 1 2 . 
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<» e "SH # I ,E ^y 
o» <'- 2 ) £ 35i +l *i‘ £ ate- 1 - 

(c) For 5(x) = (1 — c/|x| 2 )x, use the identities (a) and (b) to show that for L{8 , 5) = 
\8-S\\ 


R{0,8) = r + 2c\8\ 2 E 


—= -2c + c l E —-— 

x; +2 W x 2 W 


= r + 2c 


1 - (r - 2)E 


1 


x r Wj 


-2 c + c-E- 


Xr W 


and, hence, that 5(x) is minimax if 0 < c < 2(r — 2). 


[See Bock 1975 or Casella 1980 for more identities involving noncentral / 2 expecta¬ 
tions.] 

5.23 Let x, 2 W be a x 2 random variable with r degrees of freedom and noncentrality 
parameter X. 

(a) Show that E ^ = E^E^— |3f] = E[ vz ^ ir \, where K ~ 

Poisson(X/2). 

(b) Establish (5.32). 


5.24 For the most part, the risk function of a Stein estimator increases as 1 6 1 moves away 
from zero (if zero is the shrinkage target). To guarantee that the risk function is monotone 
increasing in \8\ (that is. that there are no “dips” in the risk as in Berger’s 1976a tail 
minimax estimators) requires a somewhat stronger assumption on the estimator (Casella 
1990). Let X ~ N r (8 , I) and L(0, 5) = \8 — S| 2 , and consider the Stein estimator 

S(x)= —c(|x| 2 )^^x. 

(a) Show that if 0 < c(-) < 2 and c(-) is concave and twice differentiable, then S(x) is 
minimax. [Hint: Problem 1.7.7.] 

(b) Under the conditions in part (a), the risk function of 5(x) is nondecreasing in \0\. 
[Hint: The conditions on c(-), together with the identity 

(d/dX)E x [h(x 2 p (m = E k {[d/dx 2 p+2 (m(X 2 p+2 M)}, 

where Xp(L) is a noncentral x 2 random variable with p degrees of freedom and 
noncentrality parameter X, can be used to show that (d/d\8\ 2 )R(6,8) > 0.] 

5.25 In the spirit of Stein’s “large r and \0 1” argument, Casella and Hwang (1982) inves¬ 
tigated the limiting risk ratio of 8 JS (x) = (1 — (r — 2)/|x| 2 )x to that of x. If X ~ N r (6, /) 
and L(8 , 8) = \0 — 5| 2 , they showed 

linwoo R(8,8 JS ) _ c 

-> c R(0,x) ~ 771' 

To establish this limit we can use the following steps. 

(a) Show that R(0 / S) = 1 - 

R(6,x) r a |X| 2 

(b) Show that-Ur- < E e -^-- < -U (—4-^ ). 

v ' p-2+\0\ 2 — |X)|- — P-2 \ p+ \8\-J 

(c) Show that the upper and lower bounds on the risk ratio both have the same limit. 
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[Hint: (b) The upper bound is a consequence of Problem 5.22(b). For the lower bound, 
show £fl(l/|X)| 2 ) = E(l/p — 2 + K ), where K ~ Poisson(|01 2 ) and use Jensen’s 
inequality.] 


Section 6 

6.1 Referring to Example 6.1, this problem will establish the validity of the expression 
(6.2) for the risk of the estimator S L of (6.1), using an argument similar to that in the 
proof of Theorem 5.7. 


(a) Show that 

R((F S L ) = ^ £ e [d,. - <5 , l (X)] 2 


J2 E » {(£,-- x,) 2 + 


[c(r - 3)] 2 - , 

+ S 2 ( X, ~ X) 2 


2c(r - 3) 


( 6 i - XMXi - X) 


where S = Xq(-^0 — X) 2 . 

(b) Use integration by parts to show 


„ (0i - X,XX, - X) 

-fcfl-r- - — Cfl 


- S + 2(Xi -X) 2 

s 2 


[Hint: Write the cross-term as —E e j (Xj — 6 j ) and adapt Stein’s identity 

(Lemma 1.5.15).] 

(c) Use the results of parts (a) and (b) to establish (6.2). 


6.2 In Example 6.1, show that: 

(a) The estimator S L is minimax if r > 4 and c <2. 

(b) The risk of S L is infinite if r < 3 

(c) The minimum risk is equal to 3 /r , and is attained at 6 { = 62 = ■■■= 9. 

(d) The estimator S L is dominated in risk by its positive-part version 


<5 i+ = x\ + 



c(r - 3) 
|x — il | 2 


(x — xl). 


6.3 In Example 6.2: 

(a) Show that kx is the MLE if 6 e Ck- 

(b) Show that <5 A ’(x) of (6.8) is minimax under squared error loss. 

(c) Verifythatfy of the form (6.4) satisfy T{T'T)~ l T'6 = 6 for T of (6.5), and construct 
a minimax estimator that shrinks toward this subspace. 

6.4 Consider the problem of estimating the mean based on X ~ N r (6 , /), where it is 
thought that 8i = ftj where (h, ..., t r ) are known, (/Si, ..., ft) are unknown, 
and r — s > 2. 


(a) Find the MLE of 6 , say 6 R , if 6 is assumed to be in the linear subspace 


C = 


7=1 
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(b) Show that C can be written in the form (6.7), and find K. 

(c) Construct a Stein estimator that shrinks toward the MLE of part (a) and prove that 
it is minimax. 


6.5 For the situation of Example 6.3: 

(a) Show that S c (x, y) is minimax if 0 < c < 2. 

(b) Show that if £ = 0, R(0 , <5‘) = 1 - R(0 , <5 comb ) = 1 - and, hence, 

R{0, 8 l ) > R(0.8 comb ). 

(c) For £ =^0, show that R(0 , <5 comb ) = 1 — an( j hence is unbounded as 

l$l oo- 


6.6 The Green and Strawderman (1991) estimator 5 c (x, y) can be derived as an empirical 
Bayes estimator. 

(a) For X\0 ~ N r (0.a 2 I), Y\0,$ ~ N r (0 + %,r 2 I), t, ~ 7V(0, y 2 /), and 6> ; ~ 
Uniform(—oo, oo), with a 2 and z 2 assumed to be known, show how to derive 
<5 r " 2 (x, y) as an empirical Bayes estimator. 

(b) Calculate the Bayes estimator, 5", against squared error loss. 

(c) Compare r(n, 8 n ) and r{n, 8 r ~ 2 ). 

[. Hint: For part (a). Green and Strawderman suggest starting with 0 ~ 1V(0, k 2 I) and let 
k 2 —> oo get the uniform prior.] 

6.7 In Example 6.4: 


(a) Verify the risk function (6.13). 

(b) Verify that for unknown a 2 , the risk function of the estimator (6.14) is given by 
(6.15). 

(c) Show that the minimum risk of the estimator (6.14) is 1 — , -y L . 


6.8 For the situation of Example 6.4, the analogous modification of the Lindley estimator 
(6.1)is 



where a 2 = S 2 /(v + 2) and S 2 /a 2 


1 - 


r - 3 


E(x; - x) 2 /a 2 


(x — xl). 


X 2 , independent of X. 


(a) Showthatf?(e,5 i )=l-^^F e ^ F . 

(b) Show that both 8 L and 8 of (6.14) can be improved by using their positive-part 
versions. 


6.9 The major application of Example 6.4 is to the situation 

Yjj ~ N(&i , a 2 ), i = 1, ..., s, j = 1,.... n, independent 
with Yj = (1 /n)Y,jYjj and a 2 = E,y(T,j — Yj) 2 /s(n — 1). Show that the estimator 


Si = y + 


(l-c (S ~ 3)a2 V 


iyt 


~y) 


is a minimax estimator, where y = Y, i jy i j/sn, as long as 0 < c < 2. 

[The case of unequal sample sizes n,- is not covered by what we have done so far. See 
Efron and Morris 1973b. Berger and Bock 1976. and Morris 1983 for approaches to this 
problem. The case of totally unknown covariance matrix is considered by Berger et al. 
(1977) and Gleser (1979, 1986).] 
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6.10 The positive-part Lindley estimator of Problem 6.9 has an interesting interpretation 
in the one-way analysis of variance, in particular with respect to the usual test performed, 
that of H 0 : 9i = d 2 = ■ ■ ■ = 9 S . This hypothesis is tested with the statistic 

F - yf/(s - i) 

Sf.Vij - yi) 2 /s(n - 1)’ 

which, under Hq , has an ^’-distribution with ,s — I and s(n — 1) degrees of freedom. 

(a) Show that the positive-part Lindley estimator can be written as 

- / 5 — 3 1 \ + 

5 ,. = v+ ^_ c __j 0 ,-y). 

(b) The null hypothesis is rejected if F is large. Show that this corresponds to using 
the MLE under H 0 if F is small, and a Stein estimator if F is large. 

(c) The null hypothesis is rejected at level a if F > F s _i, S („_i), a . For 5 = 8 and n = 6: 

(i) What is the level of the test that corresponds to choosing c = 1, the optimal risk 
choice? 

(ii) What values of c correspond to choosing a = .05 or a = .01, typical a levels. Are 
the resulting estimators minimax? 


6.11 Prove the following extension of Theorem 5.5 to the case of unknown variance, due 
to Strawderman (1973). 


Theorem 8.4 Let X 

tor 


N r (0, a 2 1) and let S 2 /o 2 ~ xl< independent of X. The estima- 


S c (x) 



c(F, S 2 )r- 2\ 

s 2 772 ) x 


where F = Y.xf / S 2 , is a minimax estimator of 6, provided 


(i) for each fixed S 2 , c(-, S 2 ) is nondecreasing, 

(ii) for each fixed F, c(F, ■) is nonincreasing, 
(Hi) 0 < c(-, ■) < 2. 


[Note that, here, the loss function is taken to be scaled by ct 2 , L(6, 5 ) = \6 — S\ 2 /cr 2 , 
otherwise the minimax risk is not finite. Strawderman (1973) went on to derive proper 
Bayes minimax estimators in this case.] 

6.12 For the situation of Example 6.5: 

(a) Show that E a -K=E o (0_. 

(b) If 1/cr 2 ~ Xv/ V ’ ^en /(|x — 0|) of (6.19) is the multivariate t -distribution, with v 
degrees of freedom and £o|X|~ 2 = (r — 2) _1 . 

(c) If 1/cr 2 ~ Y, where xl/ v is stochastically greater than Y , then 5(x) of (6.20) is 
minimax for this mixture as long as 0 < c < 2 (r — 2). 

6.13 Prove Lemma 6.2. 

6.14 For the situation of Example 6.7: 

(a) Verify that the estimator (6.25) is minimax if 0 < c < 2. (Theorem 5.5 will apply.) 
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(b) Referring to (6.27), show that 


£|«*(X) - 5 c (X)| 2 /[|X| 2 > c(r - 2)(a 2 + r 2 )] 

= -?—;E^lY - c(r - 2)] 2 I[Y > c(r - 2)] 
a 2 + r 2 Y 


where Y ~ / 2 . 

(c) If x 2 denotes a chi-squared random variable with v degrees of freedom, establish 
the identity 


Eh(x 2 v ) = vE 


/v + 2 


to show that 


r(n, S R ) = r(n , S n ) + 


1 a 4 
r — 2 a 2 + t 2 


E[Y - c(r - 2)fl[Y > c(r - 2)] 


where, now, Y ~ x 2 _ 2 - 

(d) Verify (6.29), hence showing that r(?r, <5 fi ) < r(tr, 5 C ). 

(e) Show that E(Y — a) 2 I{Y > a) is a decreasing function of a, and hence the maximal 
Bayes risk improvement, while maintaining minimaxity, is obtained at c = 2. 


6.15 For Xj ~ Poisson(k,) i = I. r. independent, and loss function L(X,8 ) = 

X(k, - Sif/xr. 


(a) For what values of a, a , and /S are the estimators of (4.6.29) minimax? Are they 
also proper Bayes for these values? 

(b) Let A = "EXj and defined; = X t /A ,i = 1,..., r. For the prior distribution 7r(0, A) = 
m(A)dA P|’ = | ddi, show that the Bayes estimator is 


8 n (x) 


fn (z) 
z + r — 1 


where z = Exj and 


fAz) = 


f A z e A m(A)dA 
J A z ~ l e~ A m(A)dA 


(c) Show that the choice m(A) = 1, yields the estimator 5(x) = [1 — (r — \)/(z+r — l)]x, 
which is minimax. 

(d) Show that the choice m(A) = (1 + A , 1 < /3 < r — 1 yields an estimator that is 
proper Bayes minimax for r > 2. 

(e) The estimator of part (d) is difficult to evaluate. However, for the prior choice 


m( A) = 


f 


t e 




dt , 1 < fi < r — 1, 


/o (1 + A t)K 
show that the generalized Bayes estimator is 

5 ir (x)= --A- x , 

z + P + r -1 

and determine conditions for its minimaxity. Show that it is proper Bayes if p > 1. 


6.16 Let Xj ~ binomial(p, rij),i = 1, ..., r, where n, are unknown and p is known. The 
estimation target is n = (m, ..., n r ) with loss function 
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(a) Show that the usual estimator x/p has constant risk r(I — p)/p. 

(b) For r >2, show that the estimator 


«(x) = 



x 

P 


dominates x/p in risk, where z = Ex,- and 0 < a < 2 (r — 1)(1 — p). 


(c) 


[Hint: Use an argument similar to Theorem 6.8, but here X t \Z is hypergeometric, 
with E(X,\z) = Zjj and var(X, |z) = z% (l - ^-) , where N = En,/| 

Extend the argument from part (b) and find conditions on the function c(-) and 
constant b so that 


«(x) = 



x 

P 


dominates x/p in risk. 


Domination of the usual estimator of n was looked at by Feldman and Fox (1968), 
Johnson (1987), and Casella and Strawderman (1994). The problem of n estimation for 
the binomial has some interesting practical applications; see Olkin et al. 1981, Carroll 
and Lombard 1985. Casella 1986. Although we have made the unrealistic assumption 
that p is known, these results can be adapted to the more practical unknown p case (see 
Casella and Strawderman 1994 for details). 

6.17 (a) Prove Lemma 6.9. [Hint: Change variables from x to x — e, , and note that /t, 
must be defined so that <5°(0) = 0.] 

(b) Prove that for X ~ pi(x\6). where p,(x|$) is given by (6.36), <5°(x) = h,(x — 
1)/ hj(x) is the UMVU estimator of 9 (Roy and Mitra 1957). 

(c) Prove Theorem 6.10. 

6.18 For the situation of Example 6.11: 

(a) Establish that x + g(x), where g(x) is given by (6.42), satisfies D(x) < 0 for the 
loss Lq(0 , S) of (6.38), and hence dominates x in risk. 

(b) Derive T>(x) for Xj ~ Poisson(k,), independent, and loss L_i(A., 5) of (6.38). Show 
that x + g(x), for g(x) given by (6.43), satisfies D(x) < 0 and hence is a minimax 
estimator of X. 


6.19 For the situation of Example 6.12: 

(a) Show that the estimator <5o(x) + g(x), for g(x) of (6.45) dominates S° in risk under 
the loss L_i(d, <5) of (6.38) by establishing that V(x) < 0. 

(b) For the loss L 0 (6 , <5) of (6.38), show that the estimator 5°(x) + g(x), where 

c(x)ki(Xi) 

= -ttttt-, 

£>i [*,■(*/)+ (^-jkjiXj)] 

with ki(x) = X.f =1 (tj — 1 + £)/£ and c(-) nondecreasing with 0 < c(-) < 2[(#X;S > 
1) — 2] has T>(x) < 0 and hence dominates 5°(x) in risk. 

6.20 In Example 6.12, we saw improved estimators for the success probability of nega¬ 
tive binomial distributions. Similar results hold for estimating the means of the negative 
binomial distributions, with some added features of interest. Let X\,... ,X r be inde¬ 
pendent negative binomial random variables with mass function (6.44), and suppose we 
want to estimate p. = (jU, J, where /r, = fid,7(1 — 9/), the mean of the ith distribution, 
using the loss L(/x, <5) = E(/r,- — <5,) 2 //L. 
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(a) Show that the MLE of [x is X, and the risk of an estimator 5(x) = x + g(x) can be 
written 

*) = RQi, X) + E^[V t (X) + © 2 (X)] 

where 

X>i(x) = I 2 [g,(x + e ; ) - g,(x)] + 81 (X + e '- ) | 

,=1 I Xj + l j 

and 

V 2 (x) = 1 2 A ' [g,(x + e,) - g,(x)] 

f=t I fi 

g?( x + O) 
ti 

so that a sufficient condition for domination of the MLE is T> i(x) + Z> 2 (x) < 0 for 
all x. [Use Lemma 6.9 in the form Ef(X)/9j = E /(X + 

(b) Show that if X, are Poisson)#,) (instead of negative binomial), then T> 2 (x) = 0. 
Thus, any estimator that dominates the MLE in the negative binomial case also 
dominates the MLE in the Poisson case. 

(c) Show that the Clevenson-Zidek estimator 



Mx) 


\ Ex,- + r — 1 


X 


satisfies T> j(x) < 0 and T> jfx) < 0 and, hence, dominates the MLE under both the 
Poisson and negative binomial model. 


This robustness property of Clevenson-Zidek estimators was discovered by Tsui (1984) 
and holds for more general forms of the estimator. Tsui (1984, 1986) also explores other 
estimators of Poisson and negative binomial means and their robustness properties. 


Section 7 

7.1 Establish the claim made in Example 7.2. Let Xi and X 2 be independent random 
variables, X, ~ N( 6 j , 1), and let L(( 6 \, 9 2 ), S ) = (0\ — S) 2 . Show that S = sign(X 2 ) is an 
admissible estimate of 9 1 , even though its distribution does not depend on 81 . 

7.2 Efron and Morris (1973a) give the following derivation of the positive-part Stein es¬ 
timator as a truncated Bayes estimator. For X ~ N r (6, a 2 1), r > 3, and 6 ~ N( 0, r 2 /), 
where a 2 is known and r 2 is unknown, define t = a 2 /(a 2 + r 2 ) and put a prior 
h(t), 0 < t < 1 on r. 


(a) Show that the Bayes estimator against squared error loss is given by E (8 |x) = 
[1 — ,E(f|x)]x where 


t r / 2 e -‘W 2 / 2 h(t) 

n(t |x) = —j-. 

f 0 t r / 2 e- 0 x P/ 2 h(t)dt 


(b) For estimators of the form 5 T (x) = ^1 — r(|x| 2 )j^f j x, the estimator that satisfies 

(i) r(-) is nondecreasing, 

(ii) r(-) < c, 

(iii) S T minimizes the Bayes risk against h(t) 
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has t(|x| 2 ) = r*(|x| 2 ) = min{c, ^£(/|x)j. (This is a truncated Bayes estimator, 
and is minimax if c < 2.) 

(c) Show that if h(t ) puts all of its mass on t = 1, then 

t*(|x| 2 ) = min jc, J 

and the resulting truncated Bayes estimator is the positive-part estimator. 

7.3 Fill in the details of the proof of Lemma 7.5. 

7.4 For the situation of Example 7.8, show that if So is any estimator of 9, then the class 
of all estimators with S(.r) < S 0 (.r) for some x is complete. 

7.5 A decision problem is monotone (as defined by Karlin and Rubin 1956; see also 
Brown, Cohen and Strawderman 1976 and Berger 1985, Section 8.4) if the loss function 
L(9, S ) is, for each 9, minimized at 8 = 9 and is an increasing function of |S — 0|. An 
estimator S is monotone if it is a nondecreasing function of .t. 

(a) Show that if L(9,S) is convex, then the monotone estimators form a complete class. 

(b) If <5(x) is not monotone, show that the monotone estimator S' defined implicitly by 

P,(S’(X) < t) = P,(S(X) < t) for every t 
satisfies R(9, S') < R(9, S ) for all 9. 

(c) If A ~ N(9, 1) and L(9, S) = (9 — S) 2 , construct a monotone estimator that 
dominates 

1 —2 a — x if x < —a 
x if |x| < a 

2a — x if x > a. 

7.6 Show that, in the following estimation problems, all risk functions are continuous. 

(a) Estimate 9 with L(6 , 5(.r)) = [0 - 5(.r)] 2 , X ~ N(9, 1). 

(b) Estimate 9 with L«9, <5(x)) = 1 9- <5(x)| 2 , X ~ N r {9 , 1). 

(c) Estimate X with L(X , 5(x)) = J]/=i ~ ^'( x )) 2 > %i ~ Poissonf/.,), indepen¬ 

dent. 

(d) Estimate ft with L(/j, <5(x)) = JZ/=i A~ m (A' — <5;( x )) 2 , %i ~ Gamma(a ; , fS t ), inde¬ 
pendent, a, known. 

7.7 Prove the following theorem, which gives sufficient conditions for estimators to have 
continuous risk functions. 

Theorem 8.5 (Ferguson 1967, Theorem 3.7.1) Consider the estimation of 9 with loss 
L(9, S), where X ~ f(x\9). Assume 

(i) the loss function L(9, S ) is bounded and continuous in 9 uniformly in S (so that 
lim 9 _>0 o supj | L(9, S) - L(9 0 , S)| = 0); 

(ii) for any bounded function <p, J (p(x)f(x\9)dfi(x) is continuous in 9. 

Then , the risk function R(9, 5) = EgL(8 , S) is continuous in 9. 

[Hint: Show that 

\R(9\ 5) - R(0, 5)| < J | U9’, 8(x)) - L(9, &(x))\f(x\9’)dx 
+ j L(9, S(x))\f(x\9') — f(x\9)\ dx, 

and use (i) and (ii) to make the first integral < g/2, and (i) and (iii) to make the second 
integral < e/2.] 
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7.8 Referring to Theorem 8.5, show that condition (iii) is satisfied by 

(a) the exponential family, 

(b) continuous densities in which 9 is a one-dimensional location or scale parameter. 

7.9 A family of functions T is equicontinuous at the point Xq if, given e > 0, there exists 
S such that | f(x) — /(xo)| < s for all \x — ,v 0 1 < <5 and all / e T. (The same S works 
for all /.) The family is equicontinuous if it is equicontinuous at each .Vo. 

Theorem 8.6 (Communicated by L. Gajek) Consider estimation of 9 with loss L(9, 8), 
where X ~ f(x\9) is continuous in 9 for each x. If 


(i) 

(H) 


The family L(9, <5(x)) is equicontinuous in 9 for each 8. 
For all 9, 9' e £2, 


sup* 


f(x\9') 
f(x\9 ) 


< 00 . 


Then, any finite-valued risk function R(9, 8) = EgL(9, 8) is continuous in 9 and, hence, 
the estimators with finite, continuous risks form a complete class. 


(a) Prove Theorem 8.6. 

(b) Give an example of an equicontinuous family of loss functions. [Hint: Consider 
squared error loss with a bounded sample space.] 


7.10 Referring to Theorem 7.11, this problem shows that the assumption of continuity 
of f(x\9) in 9 cannot be relaxed. Consider the density f(x\9) that is N(9, 1) if 0 < 0 
and N(9 + 1, 1) if 9 >0. 

(a) Show that this density has monotone likelihood ratio, but is not continuous in 9. 

(b) Show that there exists a bounded continuous loss function L{9 — 8) for which the 
risk R(9, X ) is discontinuous. 

7.11 For X ~ /(x|0) and loss function L(9, 8) = Xa=i 0“(0i — <5,-) 2 , show that condition 
(iii) of Theorem 7.11 holds. 

7.12 Prove the following (equivalent) version of Blyth's Method (Theorem 7.13). 

Theorem 8.7 Suppose that the parameter space £2 e DP is open, and estimators with 
continuous risks are a complete class. Let 8 be an estimator with a continuous risk 
function, and let { jt„ } be a sequence of (possibly improper) prior measures such that 


(i) r(n„, 8) < oofor all n, 

(ii) for any nonempty open set ©o 6 £2, 


r(n„, 3) - r(7t n , 8 *") 
f 0o n n (9)d9 


0 as n —r oo. 


Then, 8 is an admissible estimator. 

7.13 Fill in some of the gaps in Example 7.14: 

(i) Verify the expressions for the posterior expected losses of 8° and 5- in (7.7). 

(ii) Show that the normalized beta priors will not satisfy condition (b) of Theorem 7.13, 
and then verify (7.9). 

(iii) Show that the marginal distribution of X is given by (7.10). 
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(iv) Show that 

OC OO 1 

T. D(x) < maxja 2 , b 2 } ^ — ->• 0, 

1=1 x=l X 

and hence that 8° is admissible. 

7.14 Let X ~ Poisson(k). Use Blyth’s method to show that 5° = X is an admissible 
estimator of X under the loss function L(X , 5) = (X — <5) 2 with the following steps: 

(a) Show that the unnormalized gamma priors n„(X) = k a_1 e _A/ " satisfy condition (b) 
of Theorem 7.13 by verifying that for any c, 

n n (X) dX = constant. 

Also show that the normalized gamma priors will not work. 

(b) Show that under the priors 7i n (X). the Bayes risks of 5 ° and 8 n " , the Bayes estimator, 
are given by 

r(^,5°) = n a r(a), 

= ~^—n a r(a). 

n + 1 



(c) The difference in risks is 

r(n' n , 5 °) “ r (<, = r (a)n“ (l - yyy-) , 

which, for fixed a > 0, goes to infinity as n -*■ oo (Too bad!). However, show that 
if we choose a = a(n) = l/^Jn, then T (a)n a (1 — -Ay —> 0 as n —> oo. Thus, the 
difference in risks goes to zero. 

(d) Unfortunately, we must go back and verify condition (b) of Theorem 7.13 for the 
sequence of priors with a = 1 /^/n, as part (a) no longer applies. Do this, and 
conclude that <5°(.v) = x is an admissible estimator of X. 

[Hint: For large n, since t < c/tu use Taylor’s theorem to write e~‘ = 1 — t + error, 
where the error can be ignored.] 

(Recall that we have previously considered the admissibility of 5° = X in Corollaries 
2.18 and 2.20. where we saw that 8° is admissible.) 

7.15 Use Blyth’s method to establish admissibility in the following situations. 

(a) If X ~ Gantma(tt, /S), a known, then x/a is an admissible estimator of /3 using 
the loss function L(/L 8) = (fi — 8) 2 / fi 2 . 

(b) If X ~ Negative binomial(fc, p), then X is an admissible estimator of p. = £(1 — 
p)/p using the loss function L(p, 8) = (p — 8) 2 /(p + \pt 2 ). 

7.16 (i) Show that, in general, if 8 K is the Bayes estimator under squared error loss, 
then 

r(;r, 8*) - r(n, 8 s ) = E ^(X) - 5 S (X)| 2 , 
thus establishing (7.13). 

(ii) Prove (7.15). 

(iii) Use (7.15) to prove the admissibility of X in one dimension. 
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7.17 The identity (7.14) can be established in another way. For the situation of Example 
7.18, show that 


7 


7 


r(it, 8 8 ) = r — 2 / [V logm^(x)][V \ogm g (x)]m n (x) dx 


+ / \S7\ogm g (x)\ m n (x)dx, 


which implies 


7 


r{n,8”) = r - / |V logm^x)! 2 m ir (x) dx. 


and hence deduce (7.14). 

7.18 This problem will outline the argument needed to prove Theorem 7.19: 

(a) Show that Vm g (x) = m^ g (x), that is, 

V J g(6)e-' x - 6 ' 2 cl6 = J [Vg(d)]e-' x - d ' 2 d6. 

(b) Using part (a), show that 

r(n, 8 s ) - r(g n , 8 s ") = J | Vlogm^x) - Vlogm x „(x)| 2 m gll (x) dx 


I Vm,(x) Viji-Jx) 


J v '"'Sit 

2 f 


m g (x) m gn (x) 

I Vm g (x) m h 2 Wg (x) 


m gn (x) dx 


mJx) m g (x) 


m„ (x) dx 


+2 
B,, + A„. 


7 


m gVhl ( X ) 


m g (x) 


m gn (x) dx 


(c) Show that 


An =4 


/ 


m gh 2(x) 


m gn (x) dx <4 


/ 


m g {yh„) 1 {x) dx 


and this last bound —»• 0 by condition (a). 

(d) Show that the integrand of B„ -> 0 as n —>■ oo, and use condition (b) together with 
the dominated convergence theorem to show B„ -*■ 0, proving the theorem. 


7.19 Brown and Hwang (1982) actually prove Theorem 7.19 for the case f(x\0) = 
g 0 x-f(0), w jj ere we jj-g interested in estimating z(0) = Eg(X) = W\j/(0) under the loss 
L(0,8) = \z(0) — (5| 2 . Prove Theorem 7.19 for this case. [The proof is similar to that 
outlined in Problem 7.18.] 

7.20 For the situation of Example 7.20: 


(a) Using integration by parts, show that 


3 

3 Xi 




C *1 - 0i)g(.O)e-' x - 0 ' 2 

—g(0)] e~ 1 '- 6 ' 2 d0 
dd. 


d 0 
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and hence 

Vm e (x) _ / [Vg{0)]e-' x - 0 ' 2 d6 
m g (x) j g(0) e ~d0 
(b) Use the Laplace approximation (4.6.33) to show that 

/[Vg(fl)] e -!*-0 |2 M ~ Vg(x) 
/ g(6>)e ^l 2 d0 ~ g(x) ’ 


and that 


(c) If g(0) = 1/|01*, show that 


S«(x) 


x + 


Vg(x) 
g(x) ' 


5«(x) 



X. 


7.21 In Example 7.20, if g(6) = 1 /|6 \ k is a proper prior, then 8 s is admissible. For what 
values of k is this the case? 

7.22 Verify that the conditions of Theorem 7.19 are satisfied for g(6) = 1/|0|* if (a) 
k > r — 2 and (b) k = r — 2. 

7.23 Establish conditions for the admissibility of Strawderman's estimator (Example 5.6) 

(a) using Theorem 7.19, 

(b) using the results of Brown (1971), given in Example 7.21. 

(c) Give conditions under which Strawderman’s estimator is an admissible minimax 
estimator. 


(See Berger 1975, 1976b for generalizations). 

7.24 (a) Verify the Laplace approximation of (7.23). 

(b) Show that, for /z(|x|) = k/ |x| 2 “, (7.25) can be written as (7.26) and that a = 1 is 
needed for an estimator to be both admissible and minimax. 

7.25 Theorem 7.17 also applies to the Poisson(L) case, where lohnstone (1984) obtained 

the following characterization of admissible estimators for the loss L(X , 8) = (a, — 

Sif/ki. 

A generalized Bayes estimator of the form 5(x) = [1 — /z(£.r, )]x is 

(i) inadmissible if there exists e > 0 and M < oo such that 

r — 1 — e 

/z(E.Vi) < - for Ex,- > M, 

Exj 

(ii) admissible if /?(Ex;)(Ex;) 1/2 is bounded and there exits M < oo such that 

r — 1 

/z(Exj) > - for Ex, > M. 

Ex,- 

(a) Use lohnstone’s characterization of admissible Poisson estimators (Example 7.22) 
to find an admissible Clevenson-Zidek estimator (6.31). 

(b) Determine conditions under which the estimator is both admissible and minimax. 

7.26 For the situation of Example 7.23: 

(a) Show that X/n and (n/n + 1) {X/n) (1 — X/n) are admissible for estimating p and 
p( 1 — p), respectively. 
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(b) Show that a(X/n) + (1 — a)(a/(a + b )) is an admissible estimator of p, where 
a = n/(n + a + b). Compare the results here to that of Theorem 2.14 (Karlin’s 
theorem). [Note that the results of Diaconis and Ylvisaker (1979) imply that 7r (-) = 
uniform are the only priors that give linear Bayes estimators.] 

7.27 Fill in the gaps in the proof that estimators 8” of the form (7.27) are a complete 
class. 

(a) Show that 8” is admissible when r = — 1, s = n + 1, and r + 1 = ,v. 

(b) For any other estimator 8'(x) for which 8'{x) = h{ 0) for x < r' and 8'{x) = h( 1) for 
x > s', show that we must have r' > r and s' < s. 

(c) Show that R{p, 8') < R(p, 8 n ) for all p e [0, 1] if and only if R riS (p, S’) < 
Rr,sip, 8 n ) for all p e [0, 1], 

(d) Show that f' R r , s ip, 8)k(p)dn{p) is uniquely minimized by [8”{r + 1), .. 8 n (s — 
1)], and hence deduce the admissibility of 8”. 

(e) Use Theorem 7.17 to show that any admissible estimator of h{p) is of the form 
(7.27), and hence that (7.27) is a minimal complete class. 

7.28 For i = 1,2,... ,k, let Xj ~ fi{x\6i) and suppose that 8*(Xj) is a unique Bayes 
estimator of 9j under the loss L i (6 i , 8), where L, satisfies Lj(a, a) = 0 and Lj(a, a') > 
0, a a'. Suppose that for some j, 1 < j < k, there is a value 9* such that if 9j = 8*, 

(i) Xj = x* with probability 1, 

(ii) S*(x*) = 9*. 

Show that (5*(jfi), 5?(jt2), ..., 8* k (x^)) is admissible for (8i, 9 2 , ■ ■ ■, 9 k ) under the loss 
LiiOi, 5); that is, there is no Stein effect. 

7.29 Suppose we observe X k , X 2 ,... sequentially, where Xj ~ fj(x\8j). An estimator 
of 6j = {9i, 02, ..., 9j) is called nonanticipative (Gutmann 1982b) if it only depends 
on (Xi, Xn. .... Xj). That is, we cannot use information that comes later, with indices 
> j. If 5*(x,) is an admissible estimator of 9j, show that it cannot be dominated by a 
nonanticipative estimator. Thus, this is again a situation in which there is no Stein effect. 
[Hint: It is sufficient to consider j = 2. An argument similar to that of Example 7.24 
will work.] 

7.30 For X ~ N r (6. I), consider estimation of <p'0 where <p rx i is known, using the 
estimator a'X with loss function L(ip'6, 8) = (i p'O — 8) 2 . 

(a) Show that if a lies outside the sphere (7.31), then a'X is inadmissible. 

(b) Show that the Bayes estimator of ip'6 against the prior 6 ~ N{ 0, V) is given by 

E(<p'0\x) = (I + V)- 1 <px. 

(c) Find a covariance matrix V such that E{(p'6 |x) lies inside the sphere (7.31) [V will 
be of rank one, hence of the form vv' for some r x 1 vector u]. 

Parts (a)-(c) show that all linear estimators inside the sphere (7.31) are admissible, and 
those outside are inadmissible. It remains to consider the boundary, which is slightly 
more involved. See Cohen 1966 for details. 

7.31 Brown’s ancillarity paradox. Let X ~ N r (p,, /), r > 2, and consider the estimation 
of w ’ u = S'. Wiiij, where w is a known vector with Ew; > 0, using loss function 
L(p.,d) = (w> - w'd) 2 . 

(a) Show that the estimator w'X is minimax and admissible. 
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(b) Assume no that w is the realized value of a random variable W, with distribution 
independent of X, where V = E( W'W) is known. Show that the estimator w 'd*, 
where 



with 0 < c < 2 (r — 2), dominates w'X in risk. 

IHint: Establish and use the fact that E[L(p, z/)] = E[(d — — /t)]. This is a 

special case of results established by Brown (1990a). It is referred to as a paradox 
because the distribution of the ancillary, which should not affect the estimation of fi, 
has an enormous effect on the properties of the standard estimator. Brown showed how 
these results affect the properties of coefficient estimates in multiple regression when 
the assumption of random regressors is made. In that context, the ancillarity paradox 
also relates to Shaffer’s (1991) work on best linear unbiased estimation (see Theorem 
3.4.14 and Problems 3.4.16-3.4.18.] 

7.32 Efron (1990), in a discussion of Brown’s (1990a) ancillarity paradox, proposed an 
alternate version. 

Suppose X ~ N r (fi, /), r > 2, and with probability 1/r, independent of X, the value 
of the random variable J = j is observed, j = 1,2 ,,r. The problem is to estimate 
0j using the loss function L(6j, d) = ( 6j — d) 2 . Show that, conditional on J = j, Xj is 
a minimax and admissible estimator of dj. However, unconditionally, Xj is dominated 
by the y th coordinate of the James-Stein estimator. This version of the paradox may 
be somewhat more transparent. It more clearly shows how the presence of the ancillary 
random variable forces the problem to be considered as a multivariate problem, opening 
the door for the Stein effect. 

9 Notes 

9.1 History 

Deliberate efforts to develop statistical inference and decision making not based on “in¬ 
verse probability” (i.e., without assuming prior distributions) were mounted by R.A. 
Fisher (for example, 1922, 1930, and 1935; see also Lane 1980), by Neyman and Pear¬ 
son (for example, 1933ab), and by Wald (1950). The latter’s general decision theory 
introduced, as central notions, the minimax principle and least favorable distributions in 
close parallel to the corresponding concepts of the theory of games. Many of the exam¬ 
ples of Section 5.2 were first worked out by Hodges and Lehmann (1950). Admissibility 
is another basic concept of Wald’s decision theory. The admissibility proofs in Example 
2.8 are due to Blyth (1951) and Hodges and Lehmann (1951). A general necessary and 
sufficient condition for admissibility was obtained by Stein (1955). Theorem 2.14 is due 
to Karlin (1958), and the surprising inadmissibility results of Section 5.5 had their origin 
in Stein's seminal paper (1956b). The relationship between equivariance and the mini¬ 
max property was foreshadowed in Wald (1939) and was developed for point estimation 
by Peisakoff (1950). Girshick and Savage (1951), Blackwell and Girshick (1954), Kudo 
(1955), and Kiefer (1957). 

Characterizations of admissible estimators and complete classes have included tech¬ 
niques such as Blyth’s method and the information inequality. The pathbreaking paper 
of Brown (1971) was influential in shaping the understanding of admissibility problems, 
and motivated further study of differential inequalities (Brown 1979, 1988) and asso¬ 
ciated stochastic processes and Markov chains (Brown 1971, Johnstone 1984, Eaton 
1992). 
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9.2 Synthesis 

The strengths of combining the Bayesian and frequentist approach are evident in Prob¬ 
lem 1.4. The Bayes approach provides a clear methodology for constructing estimators 
(of which REML is a version), while the frequentist approach provides the methodology 
for evaluation. There are many other approaches to statistical problems, that is, many 
other statistical philosophies. For example, there is the fiducial approach (Fisher 1959), 
structural inference (Fraser 1968, 1979), pivotal inference (Barnard 1985), conditional 
inference (Fisher 1956, Cox 1958, Buehler 1959, Robinson 1979a, 1979b), likelihood- 
based conditional inference (Barndorff-Nielsen 1980,1983, Barndorff-Nielsen and Cox, 
1979, 1994), and many more. Moreover, within each philosophy, there are many subdi¬ 
visions, for example, robust Bayesian, conditional frequentist, and so on. Examination 
of conditional inference, with both synthesis and review in mind, can also be found in 
Casella (1987, 1988, 1992b). 

An important difference among these different philosophies is the role of conditional 
and unconditional inference, that is, whether the criterion for evaluation of an estimator 
is allowed to depend on the data. 

Example 9.1 Conditional bias. If Ai,...,A„ are distributed iid as 
iV(/x,cr 2 ), both unknown, the estimator S 2 = ^ ^Z" =l (A,- — A) 2 is an unbiased es¬ 
timator of <t 2 ; that is, the unconditional expectation satisfies E a z [S 2 ] = a 2 , for all values 
of a 2 . In doing a conditional evaluation, we might ask if there is a set in the sample space 
[a reference set or recognizable subset according to Fisher (1959)] on which the condi¬ 
tional expectation is always biased. Robinson (1979b) showed that there exist constants 
a and S > 0 such that 

(9.1) E a i[S 2 | |A|/S < a] > (1 + S)a 2 for all /x, a 2 , 

showing that S 2 is conditionally biased. See Problem 1.5 for details. 

The importance of a result such as (9.1) is that the experimenter knows whether the 
recognizable set {(xi,..., x n ) : x/s < a) has occurred. If it has, then the claim that 
S 2 is unbiased may be suspect if the inference is to apply to experiments in which the 
recognizable set occurs. 

The study of conditional properties is actually better suited to examination of confidence 
procedures, which we are not covering here. (However, see TSH2, Chapter 10 for an 
introduction to conditional inference in testing and, hence, in confidence procedures.) 
The variance inequality (9.1) has many interesting consequences in interval estimation 
for normal parameters, both for the mean (Brown 1968, Goutis and Casella 1992) and the 
variance (Stein 1964, Maata and Casella 1987, Goutis and Casella 1997 and Shorrock 
1990). 

9.3 The Hunt-Stein Theorem 

The relationship between equivariance and minimaxity finds an expression in the Hunt- 
Stein theorem. Although these authors did not publish their result, it plays an important 
role in mathematical statistics. 

The work of Hunt and Stein took place in the 1940s, but it was not until the landmark 
paper by Kiefer (1957) that a comprehensive treatment of the topic, and a very general 
version of the theorem, was given. (See also Kiefer 1966 for an expanded discussion.) 
The basis of the theorem is that in invariant statistical problems, if the group satisfies 
certain assumptions, then the existence of a minimax estimator implies the existence 
of an equivariant minimax estimator. Intuitively, we expect such a theorem to exist in 
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invariant decision problems for which the group is transitive and a right-invariant Haar 
measure exists. If the Haar measure were proper, then Theorem 4.1 and Theorem 3.1 
(see also the end of Section 5.4) would apply. The question that the Hunt-Stein theorem 
addresses is whether an improper right-invariant Haar measure can yield a minimax 
estimator. 

The theorem turns out to be not true for all groups, but only for groups possessing 
certain properties. A survey of these properties, and some interrelationships, was made 
by Stone and van Randow (1968). A later paper by Bondar and Milnes (1981) reviews 
and establishes many of the group-theoretic equivalences conjectured by Stone and van 
Randow. From this survey, two equivalent group-theoretic conditions, which we discuss 
informally, emerge as the appropriate conditions on the group. 

A. Amenability. A group is amenable if there exists a right-invariant mean. That is, if 
we define the sequence of functionals 

'»,,(/) = y f(x)dx, 

where/ e then there existsafunctionalm(-)suchthatfor any f lt ..., f k e £ a0 
and f > 0 there is an n 0 such that 

|m„(/ ; ) — < e for i = 1,..., k and all n > n 0 . 

B. Approximate by proper priors, or the existence of a sequence of proper probabil¬ 
ity distributions that converge to the right-invariant Haar measure. [The concept 
of approximable by proper priors can be traced to Stein (1965), and was further 
developed by Stone (1970) and Heath and Sudderth (1989).] 

With these conditions, we can state the theorem 

Theorem 9.2 (Hunt-Stein) If the decision problem is invariant with respect to a group 
G that satisfies condition A (equivalently condition B), then if a minimax estimator 
exists, an equivariant minimax estimator exists. Conversely, if there exists an equivariant 
estimator that is minimax among equivariant estimators, it is minimax overall. 

The proof of this theorem has a history almost as rich as the theorem itself. The original 
published proof of Kiefer (1957) was improved upon by use of a fixed-point theorem. 
This elegant method is attributed to LeCam and Huber, and is used in the general devel¬ 
opment of the Hunt-Stein theorem by LeCam (1986, Section 8.6) and Strasser (1985, 
Section 48). An outline of such a proof is given by Kiefer (1966). Brown (1986b) pro¬ 
vides an interesting commentary on Kiefer’s 1957 paper, and also sketches Huber’s 
method of proof. Robert (1994a, Section 7.5) gives a particularly readable sketch of the 
proof. If the group is finite, then the assumptions of the Hunt-Stein theorem are satisfied, 
and a somewhat less complex proof will work. See Berger (1985, Section 6.7) for the 
proof for finite groups. In TSH2, Section 9.5, a version of the Hunt-Stein theorem for 
testing problems is stated and proved under condition B. 

Theorem 9.2 reduces the problem to a property of groups, and to apply the theorem we 
need to identify which groups satisfy the A/B conditions. Bondar and Milnes (1981) 
provide a nice catalog of groups, and we note that the amenable groups include finite 
groups, location/scale groups, the triangular group T(n ) of n x n nonsingular upper 
triangular matrices, and many permutation groups. ’’Large” groups, such as those arising 
in nonparametric problems, are often not amenable. A famous group that is not amenable 
is the general linear group GL„, n > 2, of nonsingular « x n matrices (see Note 3.9.3). 
See also Examples 3.7-3.9 for MRE estimators that are not minimax. 
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9.4 Recentered Sets 

The topic of set estimation has not been ignored because of its lack of importance, 
but rather because the subject is so vast that it really needs a separate book-length 
treatment. (TSH2 covers many aspects of standard set estimation theory.) Here, we will 
only comment on some of the developments in set estimators that are centered at Stein 
estimators, so-called recentered sets. 

The remarkable paper of Stein (1962) gave heuristic arguments that showed why recen¬ 
tered sets of the form 

C + = [6 : 1 0 — <5 + (x)| < c} 
would dominate the usual confidence set C 

C° = [0 : \0 -x| < c) 

in the sense that P g (0 g C + (X)) > P e (0 e C°(X)) for all 0, where X - N r {0. I),r > 3, 
and <5 + is the positive-part Stein estimator. Stein's argument was heuristic, but Brown 
(1966) and Joshi (1967) proved the inadmissibility of C° if r > 3 (without giving an 
explicit dominating procedure). Joshi (1969b) also showed that C° was admissible if 
r <2. 

Advances in this problem were made by Olshen (1977), Morris (1977, 1983a), Faith 
(1976), and Berger (1980), each demonstrating (but not proving) dominance of C° by 
Stein-like set estimators. Analytic dominance of C° by C + was established by Hwang 
and Casella (1982, 1984) and, in subsequent papers (Casella and Hwang 1983, 1987), 
dominance in both coverage probability and volume was achieved (the latter was only 
demonstrated numerically). 

Many other results followed. Generalizations were given by Ki and Tsui (1985) and 
Shinozaki (1989), and domination results for non-normal distributions by Hwang and 
Chen (1986), Robert and Casella (1990), and Hwang and Ullah (1994). 

All of these improved confidence sets have the property that their coverage probabil¬ 
ity is uniformly greater than that of C°, but the infimum of the coverage probability 
(the confidence coefficient) is equal to that of C°. As this is the value that is usually 
reported, unless there is a great reduction in volume, the practical advantages of such 
sets may be minimal. For example, recentered sets such as C + will present the same 
volume and confidence coefficient to an experimenter. Other sets, which attain some 
volume reduction but maintain the same confidence coefficient as C°, still are somewhat 
“wasteful” because they have coverage probabilities higher than that of C°. However, 
this deficiency now seems to be overcome. By adapting results of Brown et al. (1995), 
Tseng and Brown (1997) have constructed an improved confidence set, C *, with the 
property that P e (6 e C*(X)) = P g (0 e C°(X)) for every 0 , and vol(C*) < vol(C°), 
achieving a maximal amount of volume reduction while maintaining the same coverage 
probability as C°. 

9.5 Estimation of the Loss Function 

In the proof of Theorem 5.1, the integration-by-parts technique yielded an unbiased 
estimate of the risk, that is, a function X>(x) satisfying E g T>(X.) = E e L(0. 8) = R(0, 8). 
Of course, we could also consider 'D(x) as an estimate of L(0, 5), and ask if X>(x) is a 
reasonable estimator using, perhaps, another loss function such as 

C(L,T>) = (L(6,8)-T>(x)) 2 . 

If we think of L(0,8) as a measure of accuracy of 8, then we are looking for good 
estimators of this accuracy. Note, however, that this problem is slightly more complex 
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than the ones we have considered, as the “parameter” L(0 , <5(x)) is a function of both 0 
and x. 


Loss estimation was first addressed by Rukhin (1988a, 1988b, 1988c) and Johnstone 
(1988). Rukhin considered "decision-precision” losses, that combined the estimation 
of 0 and L(0, <5) into one loss. Johnstone looked at the multivariate normal problem, 
and showed that 1 is an inadmissible estimator of L(0, x) = l/r JY =1 (0,- — -L) 2 ( reca ll 
that E e [L(0, X)] = 1) and showed that estimates of the form T>(x) = 1 — (c/r)/|x| 2 
dominate it, where 0 < c < 2(r — 4). Note that this implies r > 5. Further advances in 
loss estimation are found in Lu and Berger (1989a, 1989b) and Fourdrinier and Wells 
(1995). 


The loss estimation problem is closely tied to the problem of set estimation, and actually 
transforms the set estimation problem into one of point estimation. Suppose C(x) is a 
set estimator (or confidence set ) for 0. and we measure the worth of C(x) with the loss 
function 


L(0, C) = 


0 

1 


if 0 e C 
if 0 f C. 


We usually calculate R(0, C) = E e L(0 , C(X)) = P o (0 e C(X)), the probability of cov¬ 
erage of C. Moreover, 1 — a = infs Pg(0 e C(X)) is usually reported as our confidence 
in C. However, it is really of interest to estimate L(0, C), the actual coverage. We can 
thus ask how well 1 — a estimates this quantity, and if there are estimators y (x) (known 
as estimators of accuracy) that are better. Using the loss function 


C(0,y) = (L(0,C)-y(x)) 2 , 


a number of interesting (and some surprising) results have been obtained. In the mul¬ 
tivariate normal problem improved estimates have been found for the accuracy of the 
usual confidence set (Lu and Berger 1989a, 1989b, Robert and Casella 1994) and for the 
accuracy of Stein-type confidence sets (George and Casella, 1994). However, Hwang 
and Brown (1991) have shown that under an additional constraint (that of frequency 
validity), the estimator 1 — a is an admissible estimator of the accuracy of the usual 
confidence set. 

Other situations have also been considered. Goutis and Casella (1992) have demonstrated 
that the accuracy statement of Student’s t interval can be uniformly increased, and will 
still dominate 1 — a under squared error loss. Hwang et al. (1992) have looked at 
accuracy estimation in the context of testing, where complete classes are described and 
the question of the admissibility of the p value is addressed. More recently, Lindsay and 
Yi (1996) have shown that, up to second-order terms, the observed Fisher information is 
the best estimator of the expected Fisher information, which is the variance (or loss) of 
the MLE. One can think of this result as a decision-theoretic formalization of the work 
of Efron and Hinkley (1978). 

This variant of loss estimation as confidence estimation also has roots in the work 
of Kiefer (1976, 1977) who considered an alternate approach to the assignment of 
confidence (see also Brown 1978). 


9.6 Shrinkage andMulticollinearity 

In Sections 5 and 6, we have assumed that the covariance is known, and hence, without 
loss of generality, that it is the identity. This has led to shrinkage estimators of the form 
<5, (x) = (1 — h(x))x; , that is, estimators that shrink every coefficient by the same fraction. 
If the original variances are unequal, say X ~ N r {0, £), then it may be more desirable 
to shrink some coordinates more than others (Efron and Morris 1973a, 1973b, Morns 
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1983a). Intuitively, it seems reasonable for the amount of shrinkage to be proportional 
to the size of the componentwise variance, that is, the greater the variance the greater 
the shrinkage. This would tend to leave alone the coefficients with small variance, and 
to shrink the coefficients with higher variance relatively more, bringing the ensemble 
information more to bear on these coefficients (and improving the variance of the coef¬ 
ficient estimates). This strategy reflects the shrinkage pattern of a Bayes estimator with 
prior 6 ~ N (0, r 2 1). 

However, general minimax estimators of Hudson (1974) and Berger (1976a, 1976b) 
(see also Berger 1985, Section 5.4.3, and Chen 1988), on the contrary, shrink the lower 
variance coordinates more than the higher variance coordinates. What happens is that 
the variance/bias trade-off is profitable for coordinates with low variance, but not so for 
coordinates with high variance, where X t is minimax anyway. 

This minimax shrinkage pattern is directly opposite to what is advocated to relieve the 
problem of multicollinearity in multiple regression problems. For that problem, work 
that started from ridge regression (Hoerl and Kennard 1971a, 1971b) advocated shrink¬ 
age patterns that are similar to those arising from the IV(0, r 2 /) prior—and were in the 
opposite direction of the minimax pattern. There is a large literature on ridge regression, 
with much emphasis on applications and data analysis, and less on this dichotomy of 
shrinkage patterns. A review of ridge regression is given by Draper and van Nostrand 
(1979), and some theoretical properties are investigated by Brown and Zidek (1980), 
Casella(1980), and Obenchain (1981); see also Oman 1985 for a discussion of appropri¬ 
ate prior distributions. Casella (1985b) attempts to resolve the minimax/multicollinear 
shrinkage dilemma. 

9.7 Other Minimax Considerations 
The following provides a guide to some additional minimax literature. 

(i) Bounded mean 

The multiparameter version of Example 2.9 suffers from the additional complica¬ 
tion that many different shapes of the bounding set may be of interest. Shapes that 
have been considered are convex sets (DasGupta, 1985), spheres and rectangles 
(Berry, 1990), and hyperrectangles (Donoho, et al. 1990). Other versions of this 
problem that have been investigated include different loss functions (Bischoff and 
Fieger 1992. Eichenauer-Herrmann and Fieger 1992), gamma-minimax estimation 
(Vidakovic and DasGupta 1994), other distributions (Eichenauer-Herrmann and 
Ickstadt 1992), and other restrictions (Fan and Gijbels 1992, Spruill 1986, Feld¬ 
man 1991). Some other advances in this problem have come from application of 
bounds on the risk function, often derived using the information inequality (Gajek 
1987, 1988, Brown and Gajek 1990, Brown and Low 1991, Gajek and Kaluzka 
1995). Truncated mean problems also underlie many deeper problems in estima¬ 
tion, as illustrated by Donoho (1994) and Johnstone (1994). 

(ii) Selection of shrinkage target 

In Stein estimation, much has been written on the problem of selecting a shrinkage 
target. Berger (1982a, 1982b) shows how to specify elliptical regions in which maxi¬ 
mal risk improvement is obtained, and also shows that desirable Bayesian properties 
can be maintained. Oman (1982a, 1982b) and Casella and Hwang (1987) describe 
shrinking toward linear subspaces, and Bock (1982) shows how to shrink toward 
convex polyhedra. George (1986a, 1986b) constructs estimators that shrink toward 
multiple targets, using properties of superharmonic functions establish minimaxity 
of multiple shrinkage estimators. 
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(iii) A Bayes/minimax compromise 

Bayes estimation subject to a bound on the maximum risk was first considered by 
Hodges and Lehmann (1952). Although such problems tend to be computationally 
difficult, the resulting estimators often show good performance on both frequentist 
and Bayesian measures (Berger 1982a, 1982b, 1985, Section 4.7.7, DasGupta and 
Bose 1988, Chen 1988, DasGupta and Studden 1989) Some of these properties are 
related to those of Stein-type shrinkage estimators (Bickel 1983,1984, Kempthome 
1988a, 1988b) and some are discussed in Section 5.7. 

(iv) Superharmonicity 

The superharmonic condition, although often difficult to verify, has sometimes 
proved helpful in not only establishing minimaxity, but also in understanding what 
types of prior distributions may lead to minimax Bayes estimators. Berger and 
Robert (1990) applied the condition to a family of hierarchical Bayes estimators 
and Haff and Johnson (1986) generalized it to the estimation of means in exponential 
families. More recently, Fourdrinier, Strawderman, and Wells (1998) have shown 
that no superharmonic prior can be proper, and were able to use Corollary 5.11 to 
establish minimaxity of a class of proper Bayes estimators, in particular, the Bayes 
estimator using a Cauchy prior. The review article of Brandwein and Strawderman 
(1990) contains other examples. 

(v) Minimax robustness 

Minimax robustness of Stein estimators, that is, the fact that minimaxity holds 
over a wide range of densities, has been established for many different spherically 
symmetric densities. Strawderman (1974) was the first author to exhibit minimax 
Stein estimators for distributions other than the normal. [The work of Stein (1956b) 
and Brown (1966) had established the inadmissibility of the best invariant estimator, 
but explicit improvements had not been given.] Brandwein and Strawderman (1978, 
1980) have established minimax results for wide classes of mixture distributions, 
under both quadratic and concave loss. Elliptical distributions were considered by 
Srivastava and Bilodeau (1989) and Cellier, Fourdrinier. and Robert (1989). where 
domination was established for an entire class of distributions. 

(vi) Stein estimation 

Other topics of Stein estimation that have received attention include matrix esti¬ 
mation (Efron and Morris 1976b, Haff 1979, Dey and Srinivasan 1985. Bilodeau 
and Srivastava 1988, Carter et al. 1990, Konno 1991). regression problems (Zidek 
1978, Copas 1983, Casella 1985b, Jennrich and Oman 1986, Gelfand and Dey 
1988, Rukhin 1988c. Oman 1991), nonnormal distributions (Bravo and MacGib- 
bon 1988, Chen and Hwang 1988, Srivastava and Bilodeau 1989, Cellier etal. 1989, 
Ralescu et al. 1992), robust estimation (Liang and Waclawiw 1990, Konno 1991), 
sequential estimation (Natarajan and Strawderman 1985. Sriram and Bose 1988, 
Ghosh et al. 1987). and unknown variances (Berger and Bock 1976, Berger et al. 
1977, Gleser 1979, 1986, DasGupta and Rubin 1988, Honda 1991, Tan and Gleser 
1992). 


9.8 Other Admissibility Considerations 
The following provides a guide to some additional admissibility literature. 
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(i) Establishing admissibility 

Theorems 7.13 and 7.15 (and their generalizations; see, for example, Farrell 1968) 
represent the major tools for establishing admissibility. The other admissibility re¬ 
sult we have seen (Karlin's theorem, Theorem 2.14) can actually be derived using 
Blyth’s method. (See Zidek 1970, Portnoy 1971, and Brown and Flwang 1982 for 
more general Karlin-type theorems, and Berger 1982b for a partial converse.) Com¬ 
bining these theorems with a thorough investigation of the differential inequality 
that results from an integration by parts can also lead to some interesting charac¬ 
terizations of the behavior of admissible estimators (Portnory 1975, Berger 1976d, 
1976e, 1980a. Brown 1979, 1988). A detailed survey of admissibility is given by 
Rukhin (1995). 

(ii) Dimension doubling 

Note that for the Poisson case, in contrast to the normal case, the factor r — 1 tends to 
appear (instead of r — 2). This results in the Poisson sample mean being inadmissible 
in two dimensions. This occurrence was first explained by Brown (1978, Section 
2.3), who noted that the Poisson problem in k dimensions is "qualitatively similar” 
to the location problem in 2k dimensions (in terms of a differential inequality 
derived to establish admissibility). In Johnstone and MacGibbon (1992), this idea 
of “dimension doubling” also occurs and provides motivation for the transformed 
version of the Poisson problem that they consider. 

(iii) Finite populations 

Although we did not cover the topic of admissibility in finite population sampling, 
there are interesting connections between admissibility in multinomial, nonpara- 
metric, and finite population problems. 

Using results of Meeden and Ghosh (1983) and Cohen and Kuo (1983), Meeden, 
Ghosh and Vardeman (1985) present a theorem that summarizes the admissibility 
connection, relating admissibility in a multinomial problem to admissibility in a 
nonparametric problem. 

Stepwise Bayes arguments, which originated with Johnson (1971) [see also Alarn 
1979, Hsuan 1979, Brown 1981] are useful tools for establishing admissibility in 
these situations. 
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CHAPTER 6 


Asymptotic Optimality 


1 Performance Evaluations in Large Samples 

The performance of the estimators developed in the preceding chapters—UMVU, 
MRE, Bayes, and minimax—is often difficult to evaluate exactly. This difficulty 
can frequently be overcome by computing power (particularly by simulation). 
Although such an approach works well on a case-by-case basis, it lacks the ability 
to provide an overall picture of performance which is needed, for example, to 
assess robustness and efficiency. We shall consider an alternative approach here: 
to obtain approximations to or limits of performance measures as the sample size 
gets large. Some of the probabilistic tools required for this purpose were treated 
in Section 1.8. 

One basic result of that section concerned the consistency of estimators, that 
is, their convergence in probability to the parameters that they are estimating. For 
example, if X\, Xt, ... are iid with E{Xi) = | and var(Z,) = cr 2 < oo, it was 
seen in Example 1.8.3 that the sample mean X is a consistent estimator of £. More 
detailed information about the large-sample behavior of X can be obtained from 
the central limit theorem (Theorem 1.8.9), which states that 

(1.1) ^-?)4«0,a 2 ). 

The limit theorem (1.1) suggests the approximation 

(L2) P (x^ + A)^(^) 

where <l> denotes the standard normal distribution function. 

Instead of the probabilities (1.2), one may be interested in the expectation, 
variance, and higher moments of X and then find 

2 a 2 

(1.3) E(X) = E(X-^) 2 =—- 

n 

.1 - 4 1 3(n - 1) 4 

E(X-tf = -n 3 , E(X — f) 4 = —M4 +-—<x 4 

n L n 3 n J 

where /r /( = E( X \ — £ /. We shall be concerned with the behavior corresponding 
to (1.1) and (1.3) not only of X but also of functions of X. 

As we shall see, performance evaluation of statistics h{X n ) based on, respec¬ 
tively, (1.1) and (1.3)—the asymptotic distribution and limiting moment approach— 
agree often but not always. It is convenient to have both approaches since they tend 
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to be applicable in different circumstances. In the present section, we shall begin 
with (1.3) and then take up (1.1). 

(a) The limiting moment approach 


Theorem 1.1 Let X\,... ,X n be iid with E(X i) = £, var(X i) = a 2 , and finite 
fourth moment, and suppose h is a function of a real variable whose first four 
derivatives h'(x), h"(x), h"'(x), and h (lv> (x) exist for all x € I, where I is an 
interval with P(X\ e I) = 1. Furthermore, suppose that \h f,v> (x)\ < M for all 
x G I, for some M < 00. Then, 

a 2 

(1.4) E[h(x)] = hm+7r h m +R »’ 

2 n 

and if, in addition, the fourth derivative of h 2 is also bounded, 

<7 2 2 

(1.5) var[A(X)]= — [If (i;)] 2 + R„, 

n 

where the remainder R n in both cases is (9(1 / n 2 ), that is, there exist no and A < oo 
such that R„( £) < A/n 2 for n > no and all t;. 

Proof The reason for the possibility of such a result is the strong set of assump¬ 
tions concerning /;, which permit an expansion of h{X„) about h(fi) with bounded 
coefficients. Using the assumptions on the fourth derivative h <,v> (x), we can write 


(i.6) h(x n ) = KH)+hfmn -?)+ \h'\m n - ?) 2 

+U"W(x n - + R(x„, $) 

0 

where 


(1.7) 


\R(x n ,H)\ < 


M(x n - g) 4 
24 


Using (1.3) and taking expectations of both sides of (1.6), we find 


(1.8) 


1 


E[h(X n )\=h^)+-h"(l)— + 0{- 


Here, the term in h'(fi) is missing since E(X„) = f and the order of the remainder 
term follows from (1.3) and (1.7). 

To obtain an expansion of var[/i(X„)], apply (1.8) to h 2 in place of It, using the 
fact that 

(1.9) [h 2 m" = 2 +[h'm 2 }- 

This yields 

( 1 . 10 ) E[h\x n )] = h 2 {H) + m)h"M + (/i'(f)) 2 ] — + o ( 4) , 

n \n / 


and it follows from (1.8) that 

(l.n) [Eh(x n )] 2 = /i 2 (d+ hm'xs)— + o (4 V 

n \ n / 
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Taking the difference proves the validity of (1.5). □ 

Equation (1.4) suggests the following definition. 

Definition 1.2 A sequence of estimators S„ of h(f) is unbiased in the limit if 
(1.12) £[£„] —> /;(£) as n -> oo. 


that is, if the bias of <5„, — /;(£), tends to 0 as n -> oo. 

Whenever (1.4) holds, the estimator h(X) of h{%) is unbiased in the limit. 

Example 1.3 Approximate variance of a binomial UMVU estimator. Consider 
the UMVU estimator T(n — T)/n(n — 1) of pq in Example 2.3.1. Note that 
£ = E(X) = p and a 2 = pq and that X = T /«, and write the estimator as 

5(X) = X(1-X)-^. 

n — 1 


To obtain an approximation of its variance, let us consider first h(X) = X(] — X). 
Then, h\p) = 1—2 p = q — p and var[/i(A)] = (1 /n)pq(q — p) 2 + 0(l/n 2 ). Also, 

(^t) = (1-1 /n) 2 = [ + 7: +0 {^)- 

Thus, 


var S(X) = 


n — 1 
pq(q - p) 2 


var h(X ) 


+ O ( —r 


n 


n z 


2 / 1 

1+-+0 - 


P1<1 zll + 0 (' 

n \n 2 


The exact variance of S(X) given in Problem 2.3.1(b) shows that the error is 
2 p 2 q 2 /n(n — 1) which is, indeed, of the order 1 /« 2 . The maximum absolute error 
occurs at p = 1/2 and is l/8n(n — 1). It is a decreasing function of n which, for 
n = 10, equals 1/720. On the other hand, the relative error will tend to be large, 
unless p is close to 0 or 1 (Problem 1.2; see also Examples 1.8.13 and 1.8.15). || 


In this example, the bounded derivative condition of Theorem 1.1 is satisfied 
for all polynomials h because X is bounded. On the other hand, the condition fails 
when h is polynomial of degree k > 4 and the A’s are, for example, normally 
distributed. However, (1.5) continues to hold in these circumstances. To see this, 
carry out an expansion like (1.6) to the (k— l)st power. The kth derivative of h is then 
a constant M, and instead of (1.7), the remainder will satisfy R = M(X — S) k /kl. 
This result then follows from the fact that all moments of the Z’s of order < k exist 
and from Problem 1.1. This argument proves the following variant of Theorem 1.1. 

Theorem 1.4 In the situation of Theorem 1.1, formulas (1.4) and (1.5) remain 
valid if for some k > 3 the function h has k derivatives, the kth derivative is 
bounded, and the first k moments of the X’s exist. 
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To cover estimators such as 

(u - X)j 

of p in Example 2.2.2, in which the function h depends on n, a slight generalization 
of Theorem 1.1 is required. 

Theorem 1.5 Suppose that the assumptions of Theorem 1.1 hold, and that c n is a 
sequence of constants satisfying 

(1.14) c„ = l + - + O 

n 

Then, the variance of 

(1.15) 8 n (X) = h(c n X) 
satisfies 

a 1 2 

(1.16) var[5„(Z)] = — [h'^)} 2 + 0 

n 

The proof is left as an exercise (Problem 1.3). 

Example 1.6 Approximate variance of a normal probability estimator. For 

the estimator S„(X) given by (1.13), we have 

-&H) =1 + i +0 (^ 

and 

8 n = h(c„Y) = <t>(—c„Y ) where T, = X ,• — u. 

Thus, 

(1.17) A'($) = -0(1), /»"(£) = MQ) 

and hence from (1.16) 

(1.18) var 8 n (X)=-<f> 2 (u-i;)+o(\). 

n \nj 

Since £ is unknown, it is of interest to note that to terms of order 1 /n, the maximum 
variance is 1 /2j rn. 

If the factor «Jn/(n — 1) is neglected and the maximum likelihood estimator 
5(A) = <J>(m — X) is used instead of <5„, the variance is unchanged (up to the order 
l/«); however, the estimator is now biased. It follows from (1.8) and (1.17) that 

E8(X) = p + - §) + O 

In 

so that the bias is of the order 1 / n. The MLE is therefore unbiased in the limit. || 

The approximations of the accuracy of an estimator indicated by the above 
theorems may appear somewhat restrictive in that they apply only to functions 
of sample means. However, this covers all sufficiently smooth estimators based 
on samples from one-parameter exponential families. For on the basis of such a 
sample, T = UTiXfi/n [in the notation of (8.1)] is a sufficient statistic, so that 





(1.13) 


8,i(X) = d> 
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attention can be restricted to estimators that are functions of T (and possibly n). 
Extensions of the approximations of Theorems 1.1 - 1.5 to functions of higher 
sample moments are given by Cramer (1946a, Sections 27.6 and 28.4) and Ser- 
fling (1980, Section 2.2). On the other hand, this type of approximation is not 
applicable to optimal estimators for distributions whose support depends on the 
unknown parameter, such as the uniform or exponential distributions. Here, the 
minimal sufficient statistics and the estimators based on them are governed by 
different asymptotic laws with different convergence rates. (Problems 1.15-1.18 
and Examples 7.11 - 7.14). 

The conditions on the function /;(•) of Theorem 1.1 are fairly stringent and do 
not apply, for example, to h(x) = l/i or ~Jx (unless the X ,■ are bounded away 
from zero) and the corresponding fact also limits the applicability of multivariate 
versions of these theorems (see Problem 1.27). When the assumptions of the the¬ 
orems are not satisfied, the conclusions may or may not hold, depending on the 
situation. 

Example 1.7 Limiting variance in the exponential distribution. Suppose that 
X |,..., X n are iid from the exponential distribution with density (1 /0)e ~ x/e ,x > 
0, 9 > 0, so that EX , = 0 and var A, = 0 2 . The assumptions of Theorem 1.1 do not 
hold for h(x ) = >/x, so we cannot use (1.5) to approximate var(Vx). However, 
an exact calculation shows that (Problem 1.14) 



and that lim^oo n var(\/x) = 9 /4 = 9 2 [h'(9)] 2 . 

Thus, although the assumptions of Theorem 1.1 do not apply, the limit of the 
approximation (1.5) is correct. For an example in which the conclusions of the 
theorem do not hold, see Problem 1.13(a). j 

Let us next take up the second approach mentioned at the beginning of the 
section. 

(b) The asymptotic distribution approach 

Instead of the behavior of the moments E[h(X)] and var[/t(X)], we now consider 
the probabilistic behavior of h(X). 

Theorem 1.8 IfX \,... , X n are iid with expectation f, and h is any function which 
is continuous at then 

(1.19) h(X) 4 /i(£) as n oo. 

p 

Proof. It was seen in Section 1.8 (Example 8.3) that X -* |. The conclusion 
(1.19) is then a consequence of the following general result. □ 

Theorem 1.9 If a sequence of random variables T n tends to f in probability and 

p 

if h is continuous at then h(T n ) —> /;(£). 
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Proof. To show that P(\h(T n ) — /;(£)| < a) —>■ 1, it is enough to notice that by the 
continuity of h, the difference \h(T„) — /;(£)| will be arbitrarily small if T n is close 
to f, and that for any a , the probability that \T„ — f | < a tends to 1 as n -a- oo. 
We leave a detailed proof to Problem 1.30 □ 

Consistency can be viewed as a probabilistic analog of unbiasedness in the 
limit but, as Theorem 1.8 shows, requires much weaker assumptions on h. Unlike 
those needed for Theorem 1.1, they are satisfied, for example, when £ f 0 and 
h(x) = 1/x or h(x ) = l/^/x. 

The assumptions of Theorem 1.1 provide sufficient conditions in order that 

(1.20) h(X)\ -a- cr 2 [/;'(f)] 2 as n —> oo. 

On the other hand, it follows from Theorem 8.12 of Section 1.8 that v 2 = <J 2 [h'( £)] 
is also the variance of the limiting distribution N( 0, v 2 ) of fy n[h(X ) — /;(£)]. This 
asymptotic normality holds under the following weak assumptions on h. 

Theorem 1.10 Let X\, ,.., X n be iid with E{Xi) = f and vart X , ) = a 2 . Suppose 
that 

(a) the function h has a derivative W with h’(f) f 0, 

(b) the constants c„ satisfy c„ = 1 +a/n + 0{\/n 2 ). 

Then, 

(i) ~Jn[h(c n X) — /j(f)] h as the normal limit distribution with mean zero and 
variance cr 2 [h'(i ;)] 2 ; 

(ii) ifh'(l;) = Obuth"(i;)existsandisnotO,thenn[h(c n x)—h(t;)] -a- \o 2 h"(f)xf- 

Proof Immediate consequence of Theorems 1.8.10 - 1.8.14. □ 

Example 1.11 Continuation of Example 1.6. For T, = X, — u, we have E Y t = 
% — u and var Y-, = a 2 . The maximum likelihood estimator of <L>(u — f) is given 
by S' n = Of— Y„) and Theorem 1.10 shows 

Vn[S' n - O (u - |)] 4- N (0, f 2 (u - D). 

The UMVU estimator is S n = Of— c n Y n ) as in (1.13), and again by Theorem 1.10, 
*Jn[h - Of« - |)] X N(0, tp 2 (u - §)). || 

Example 1.12 Asymptotic distribution of squared mean estimators. Let X\, 

.. .,Z„beiid N(6, a 2 ), andlettheestimandbef? 2 . Three estimators of @ 2 (Problems 
2.2.1 and 2.2.2) are 

cr 2 

^ 1( , = X~ -(UMVU when cr is known), 

n 

- T S 2 

& 2 „ = X 2 -(UMVU when a is unknown) 

n(n — 1) 

where S 2 = E(V,- — X) 2 , 

S 3 „ = X 2 (MLE in either case). 
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For each of these three sequences of estimators <5(X), let us find the limiting distri¬ 
bution of [<5(X) — 0 2 ] suitably normalized. Now, — 0) -a N( 0, cr 2 ) in law 

by the central limit theorem. Using h(u) = u 2 in Theorem 1.8.10, it follows that 

(1.21) Vn(X 2 - 6 1 ) 4 A(0, 4cr 2 0 2 ), 

provided h'( 6 ) = 29 ^ 0 . 

Next, consider <5 1(i . Since 

^ (r- - — - e 2 ") = jTi(x 2 -0 2 )-^-, 

\ n ) Jn 

it follows from Theorem 1.8.9 that 


( 1 . 22 ) 


Vn(5i„ - 0 2 ) 4 N( 0, 4 o 2 0 2 ). 


Finally, consider 





s 2 

n(n — 1) 



= V^(X 2 - 9 2 ) - 


1 



Now, S 2 /(n — 1) tends to a 2 in probability, so S 2 /\[n (« —1) tends to 0 in probability. 
Thus, 


Vntfin - o 2 ) 4 N( 0, 4 a 2 e 2 ). 


Hence, when 0 ¥ 0, all three estimators have the same limit distribution. 

There remains the case 6 = 0. It is seen from the Taylor expansion |for example. 
Equation (1.6)] that if h'( 6 ) = 0, then 


«/n[h(T„) — h( 6 )] —»• 0 in probability. 


Thus, in particular, in the present situation, when 0 = 0, ^/n[ 8 (X) — 0 2 ] -a 0 in 
probability for all three estimators. When h'(9) = 0, *Jn is no longer the appropriate 
normalizing factor: it tends to infinity too slowly. 

Let us therefore apply the second part of Theorem 1.10 to the three estimators 
Sj„ (i = 1,2, 3) when 0 = 0. Since /z"(0) = 2, it follows that for 83 ,, = X 2 , 


n(X 2 - 0 2 ) = n(X 2 - 0 2 ) -* \° 2 (2x4 = 

Actually, since the distribution of *J~nX is AT0, cr 2 ) for each n, the statistic nX 2 
is distributed exactly as cr 2 / 2 , so that no asymptotic argument would have been 
necessary. 

For 8 i n , we find 

X 2 -—-e 2 \= n x2-a 2 , 

n ) 

and the right-hand side tends in law to cr 2 (xf — 1)- In fact, here too, this is the 
exact rather than just a limit distribution. Finally, consider 82 ,,. Here, 



n 



S 2 

n(n — 1) 



= nX 2 - 


S 2 


n — 1 ’ 
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and since S 2 /(n — 1) tends in probability to a 2 , the limiting distribution is again 
<t 2 (Xi 2 - 1). 

Although, for 9 ¥ 0, the three sequences of estimators have the same limit 
distribution, this is no longer true when 9 = 0. In this case, the limit distribution 
of n (8 — 0 2 ) is a 2 (x 2 — 1) for <5i„ and S 2 „ but o 2 X\ for the MLE 8 ^ n . These 
two distributions differ only in their location. The distribution of cr 2 (/, 2 — 1) is 
centered so that its expectation is zero, while that of o 2 x 2 has expectation a 2 . 
So, although 8 \ n and 82 ,, suffer from the disadvantage of taking on negative values 
with probability > 0, asymptotically they are preferable to the MLE 8 3 „. 

The estimators S\ n and do,, of Example 1.12 can be thought of as bias-corrected 
versions of the MLE 8 i„. Typically, the MLE 9 n has bias of order 1 /«, say 


b,m = 


B(0) 


+ O 



The order of the bias can be reduced by subtracting from 6 „ an estimator of the 
bias based on the MLE. This leads to the bias-corrected ML estimator 


(1.23) 



B{9 n ) 

n 


whose bias will be of order 1 /n 2 . (For an example, see Problem 1.25; see also 
Example 7.15.) || 


To compare these bias-correcting approaches, consider the following example. 

Example 1.13 A family of estimators. The estimators 8 \ n and < 53 ,, for 6 2 of Ex¬ 
ample 1.12 are special cases (with c = 1 or 0) of the family of estimators 

(1.24) <$' c) = X 1 -. 

n 

As in Example 1.12, it follows from Theorems 1.8.10 and 1.8.12 that for 9^0, 

(1.25) s/n[8^ - 0 2 ] 4 N( 0, 4 o 2 9 2 ), 


so that the asymptotic variance is 4o 2 6 2 . 

If, instead, we apply Theorem 1.1 with h(9 ) = 9 2 , we see that 

2 

(1.26) E(X 2 ) = 0 2 + — + 0 

2 n 


1 


var 


(X 2 ). 


4ct-0 


2n2 


+ 0(4 

•siZ. 


Thus, to the first order, the two approaches give the same result. 

Since the common value of the asymptotic and limiting variance does not involve 
c, this first-order approach does not provide a useful comparison of the estimators 

(1.25) corresponding to different values of c. To obtain such a comparison, we 
must take the next-order terms into account. This is easy for approach (a), where 
we only need to take the Taylor expansions (1.10) and (1.11) a step further. In fact, 
in the present case, it is easy to calculate var(S^ c) ) exactly (Problem 1.26). 
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However, since the estimators f n c> are biased when c f 1, they should be com¬ 
pared not in terms of their variances but in terms of the expected squared error, 
which is (Problem 1.26) 


(1.27) 


E X 




4cr 2 6> 2 (c 2 - 2c + 3)cr 4 

- + - 3 - 

n n A 


In terms of this measure, the estimator is better the smaller c 2 — 2c + 3 is. This 
quadratic has its minimum at c = 1, and the UMVU estimator therefore 
minimizes the risk (1.27) among all estimators (1.25). 

Equality of the asymptotic variance and the limit of the variance does not al¬ 
ways hold (Problem 1.38). However, what can be stated quite generally is that 
the appropriately normalized limit of the variance is greater than or equal to the 
asymptotic variance. To see this, let us state the following lemma. 

Lemma 1.14 Let Y n , n = 1,2 be a sequence of random variables such that 

Y n -> Y, where E(Y ) = 0 and var(T ) = E(Y ~) = v < oo. For a constant A, 
define Y„ A = Y n / (| Y n \ < A) + AI(\Y n \ > A), the random variable Y n truncated at 
A. Then, 

(a) lim^oo lim„^oo E ( Y 2 a ) = lim^oo lim^^ £[min(F 2 , A)] = v 2 , 

(b) if EY~ -> w 2 , then w 2 < v 2 . 

Proof (a) By Theorem 1.8.8, 

lim E(Y 2 a ) = E[Y 2 I(\Y\ < A)] + A 2 P(\Y\ > A), 

n—>oo 


and as A -> oo, the right side tends to if. 

(b) It follows from Problem 1.39 that 

(1.28) lim lim E(Y 2 ) < lim lim E(Y 2 ), 

A—>o o n^-oo A n —>oo A—>o o A 


provided the indicated limit exists. Now, lim^^oo E(Y 2 ) = E(Y 2 ), so that the 
right side of (1.28) is w 2 , while the left side is t> 2 by part (a). □ 

Suppose now that T n is a sequence of statistics for which Y n = k„[T„ — E(T n )] 
tends in law to a random variable Y with zero expectation. Then, the asymptotic 
variance u 2 = var(T) and the limit of the variances w 2 = lim EXT 2 ), if it exists, 
satisfy if < uf as was claimed. (Note that w 2 need not be finite.) Conditions for 
v 2 and uf to coincide are given by Chernoff (1956). For the special case that T„ is a 
function of a sample mean of iid variables, the two coincide under the assumptions 
ofTheorems 1.1 and 1.8.12. II 


2 Asymptotic Efficiency 

The large-sample approximations of the preceding section not only provide a con¬ 
venient method for assessing the performance of an estimator and for comparing 
different estimators, they also permit a new approach to optimality that is less 
restrictive than the theories of unbiased and equivariant estimation developed in 
Chapters 2 and 3. 
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It was seen that estimators 1 of interest typically are consistent as the sample 
sizes tend to infinity and, suitably normalized, are asymptotically normally dis¬ 
tributed about the estimand with a variance v(6) (the asymptotic variance), which 
provides a reasonable measure of the accuracy of the estimator sequence. (In this 
connection, see Problem 2.1.) Within this class of consistent asymptotically nor¬ 
mal estimators, it turns out that under additional restrictions, there exist estimators 
that uniformly minimize v(6). The remainder of this chapter is mainly concerned 
with the development of fairly explicit methods of obtaining such asymptotically 
efficient estimators. 

Before embarking on this program, it may be helpful to note an important dif¬ 
ference between the present large-sample approach and the small-sample results 
here and elsewhere. Both UMVU and MRE estimators tend to be unique (Theorem 
1.7.10) and so are at least some of the minimax estimators derived in Chapter 5. 
On the other hand, it is in the nature of asymptotically optimal solutions not to be 
unique, since asymptotic results refer to the limiting behavior of sequences, and 
the same limit is shared by many different sequences. More specifically, if 

Ms,, ~ gm 4 N(0, V ) 

and {<5„} is asymptotically optimal in the sense of minimizing v, then <5„ + R„ is 
also optimal, provided 

sfn R n -* 0 in probability. 

As we shall see later, asymptotically equivalent optimal estimators can be obtained 
from quite different starting points. 

The goal of minimizing the asymptotic variance is reasonable only if the estima¬ 
tors under consideration have the same asymptotic expectation. In particular, we 
shall be concerned with estimators whose asymptotic expectation is the quantity 
being estimated. 

£ 

Definition 2.1 If k n [S„ — g(6 )] —»• H for some sequence k n , the estimator S„ of 
g{0) is asymptotically unbiased if the expectation of H is zero. 

Note that the definition of asymptotic unbiasedness is analogous to that of 
asymptotic variance. Unlike Definition 1.2, it is concerned with properties of the 
limiting distribution rather than limiting properties of the distribution of the es¬ 
timator sequence. To see that Definition 2.1 is independent of the normalizing 
constant, see Problem 2.2. 

Under the conditions of Theorem 1.8.12, the estimator h(T n ) is asymptotically 
unbiased for the parameter h(6). The estimator of Theorem 1.1 is unbiased in the 
limit. 

Example 2.2 Large-sample behavior of squared mean estimators. In Example 
1.12, all three estimators of 6 2 are asymptotically unbiased and unbiased in the 
limit. We note that these results continue to hold if the assumption of normality is 
replaced by that of finite variance. j 

1 We shall frequently use estimator instead of the more accurate but cumbersome term estimator 

sequence. 
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In Example 1.12, the MLE was found to be asymptotically unbiasedfor# Obut 
asymptotically biased when the range of the distribution depends on the parameter. 

Example 2.3 Asymptotically biased estimator. If Xi,..., X n are iid as U( 0, 9), 

the MLE of 6 is X (n) and satisfies (Problem 2.6) 

n(9 - x (n) ) 4 E( 0, e\ 

and, hence, is asymptotically biased, but it is unbiased in the limit. 

A similar situation occurs in sampling from an exponential E(a, b) distribution. 
See Examples 7.11 and 7.12 for further details. Ij 


The asymptotic analog of a UMVU estimator is an asymptotically unbiased 
estimator with minimum asymptotic variance. In the theory of such estimators, 
an important role is played by an asymptotic analog of the information inequality 
(2.5.31). If X\,..., X n are iid according to a density fg(x) (with respect to p.) 
satisfying suitable regularity conditions, this inequality states that the variance of 
any unbiased estimator <5 of g(9) satisfies 


( 2 . 1 ) 


var 0 (S) > 


[g'm 2 

nl(9 ) ’ 


where 1(6) is the amount of information in a single observation defined by (2.5.10). 
Suppose now that S„ = 8 n (X t . .... X„) is asymptotically normal, say that 


( 2 . 2 ) 


Ms„ - gm 4 N[ 0, V(d)l v(6) > 0. 


Then, it turns out that under some additional restrictions, one also has 

[g'm 2 


(2.3) 


v(6) > 


m 


However, although the lower bound (2.1) is attained only in exceptional circum¬ 
stances (Section 2.5), there exist sequences {<5„} that satisfy (2.2) with v(9) equal 
to the lower bound (2.3) subject only to quite general regularity conditions. 


Definition 2.4 A sequence {S„} = {8 t! <X]. ..., X n )\, satisfying (2.2) with 


(2.4) 


v(9) = 


[g'm 2 

1(9) 


is said to be asymptotically efficient. 


At first glance, (2.3) might be thought to be a consequence of (2.1). Two differ¬ 
ences between the inequalities (2.1) and (2.3) should be noted, however. 

(i) The estimator <5 in (2.1) is assumed to be unbiased, whereas (2.2) only implies 
asymptotic unbiasedness and consistency of {<$„}. It does not imply that 8 n is 
unbiased or even that its bias tends to zero (Problem 2.11). 

(ii) The quantity v(9) in (2.3) is an asymptotic variance whereas (2.1) refers to 
the actual variance of 8 . It follows from Lemma 1.14 that 


(2.5) 


v(9) < lim inf [/i var g8 n ] 
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but equality need not hold. Thus, (2.3) is a consequence of (2.1), provided 
(2.6) var{V^[<5„ - g(0)]} -> v(9) 


and if <S„ is unbiased, but not necessarily if these requirements do not hold. 


For a long time, (2.3) was nevertheless believed to be valid subject only to 
regularity conditions on the densities fig. This belief was exploded by the example 
(due to Hodges; see Le Cam 1953) given below. Before stating the example, note 
that in discussing the inequality (2.3) under assumption (2.2), if 9 is real-valued 
and g(9) is differentiable, it is enough to consider the case g(9) = 9, for which 
(2.3) reduces to 


(2.7) 


v(6) > 


1 

WY 


For if 

(2.8) s/n{8„ -9) 4 N[0, u(0)] 


and if g has derivative g', it was seen in Theorem 1.8.12 that 


(2.9) Vnlg(Sn) - gm 4 N[0, v(9)[g'(9)} 2 ]. 


After the obvious change of notation, this implies (2.3). 


Example 2.5 Superefficient estimator. Let X \, ..., X n be iid according to the 
normal distribution N(9, 1) and let the estimand be 9. It was seen in Table 2.5.1 
that in this case, 1(9) = 1 so that (2.7) reduces to v(9) > 1. On the other hand, 
consider the sequence of estimators. 


X if \X\ > 1 /n l/4 
aX if |X| < l/n l/4 . 


Then (Problem 2.8), 

V7i(&n - 9) 4 N[ 0, t>«9)], 

where v(9) = 1 when 9 f 0 and v(9) = a 2 when 9 = 0. If a < 1, inequality (2.3) 
is therefore violated at 9 = 0. j 


This phenomenon is quite general (Problems 2.4 - 2.5). There will typically exist 
estimators satisfying (2.8) but with v(9) violating (2.7) for at least some values of 
9, called points of superefficiency. However, (2.7) is almost true, for it was shown 
by Le Cam (1953) that for any sequence S„ satisfying (2.8), the set S of points of 
superefficiency has Lebesgue measure zero. The following version of this result, 
which we shall not prove, is due to Bahadur (1964). The assumptions are somewhat 
stronger but similar to those of Theorem 2.5.15. 

Remark on notation. Recall that we are using X, and X, and x, and x for real¬ 
valued random variables and the values they take on, respectively, and X and x for 
the vectors (Xi, ..., X„) and (xi, ..., x n ). respectively. 

Theorem 2.6 Let Xi ,..., X n be iid, each with density f(x\9) with respect to a 
a-ftnite measure \l, where 9 is real-valued, and suppose the following regularity 
conditions hold. 

(a) The parameter space £2 is an open inten’al (not necessarily finite). 
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(b) The distributions Pg of the Xj have common support, so that the set A = {x : 
f(x\9 ) > 0} is independent of 6. 

(c) For every x e A, the density f(x\9) is twice differentiable with respect to 9, 
and the second derivative is continuous in 9. 

(d) The integral f f (x\9)dp,(x) can be twice differentiated under the integral 
sign. 

(e) The Fisher information 1(9) defined by (3.5.10) satisfies 0 < 1(9) < 00. 

(f) For any given 9 q e £2, there exists a positive number c and a function M(x) 
(both of which may depend on 9 q) such that 

\d 2 log f(x\9)/d9 2 \ <M(x) 

for all ieA, 9q — c <9 <9q + c 

and 

Eg 0 (M(X)] < 00. 

Under these assumptions, if 8„ = S„(Xi, ..., X„) is any estimator satisfying 
(2.8), then v(9) satisfies (2.7) except on a set of Lebesgue measure zero. 

Note that by Lemma 2.5.3, condition (d) ensures that for all 9 e £2 

(g) E[dlogf(X\9)/d9] = 0 
and 

(h) E[- 9 2 log f(X\9)/80 2 ] = £[3 log f(X\9)/d9] 2 = 1(9). 

Condition fd) can be replaced by conditions (g) and (h) in the statement of the 
theorem. 

The example makes it clear that no regularity conditions on the densities f(x\9) 
can prevent estimators from violating (2.7). This possibility can be avoided only by 
placing restrictions on the sequence of estimators also. In view of the information 
inequality (2.5.31), an obvious sufficient condition is (2.6) [withg(0) = 0]together 
with 

(2.10) b' n (9) 0 

where b n (9) = Eg(S n ) — 9 is the bias of S„. 

If 1(9) is continuous, as will typically be the case, a more appealing assumption 
is perhaps that v(9) also be continuous. Then, (2.7) clearly cannot be violated at 
any point since, otherwise, it would be violated in an interval around this point in 
contradiction to Theorem 2.6. As an alternative, which under mild assumptions 
on / implies continuity of v(9), Rao (1963) and Wolfowitz (1965) require the 
convergence in (2.2) to be uniform in 9. By working with coverage probabilities 
rather than asymptotic variance, the latter author also removes the unpleasant 
assumption that the limit distribution in (2.2) must be normal. An analogous result 
is proved by Pfanzagl, (1970), who requires the estimators to be asymptotically 
median unbiased. 

The search for restrictions on the sequence {<$„}, which would ensure (2.7) for all 
values of 9, is motivated in part by the hope of the existence, within the restricted 
class, of uniformly best estimators for which v(9) attains the lower bound. It is 
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further justified by the fact, brought out by Le Cam (1953), Huber (1966), and Hajek 
(1972), that violation of (2.7) at a point 9q entails certain unpleasant properties of 
the risk of the estimator in the neighborhood of 9q. 

This behavior can be illustrated in the Hodges example. 

Example 2.7 Continuation of Example 2.5. The normalized risk function 
(2.11) R n (0) = nE(8„ - Of 


of the Hodges estimator <S„ can be written as 

R„(9) = 1 — (1 — a 2 ) f (x + sfn 9) 2 c/)(x) dx 

4 

(x + \fn 9)4>{x) dx 

where /„ = ^/n — ^Jn 9 and l_ n = — JjTi — *J7\ 9. When the integrals are broken up 
into their three and two terms, respectively, and the relations 

<t>'(x) = <p(x) and cp'(x) = —x(p(x) 


+29s/n(\ — a) 


f 


are used, R n (9) reduces to 


R«m = 


1 — (1 — a 2 ) J x 2 (f>(x)dx 

+„6» 2 (1 - «) 2 [0(/„) - 0(7,,)] 
+2^/n9a(l - a)[(p(I n ) - </>(/„)]. 


Consider now the sequence of parameter values 9 n = 1 so that 
VT: 9„ = n 1/4 , l n = —2 n l/ \ I„ = 0. 

Then, 

Vn 9n<KL n ) -> 0 , 

so that the third term tends to infinity as n —> oo. Since the second term is positive 
and the first term is bounded, it follows that 


Rn(On) ->• OO for 0 n = 1 / f/Tl , 


and hence, a fortiori, that 


sup e R n (9) ->• oo. 

Let us now compare this result with the fact that (Problem 2.12) for any fixed 9 
R„(0) —■> 1 for 9^0, R„(0) -» a 2 . 


(This shows that in the present case, the limiting risk is equal to the asymptotic 
variance (see Problem 2.4).) The functions R„(9) are continuous functions of 0 
with discontinuous limit function 

L(0)= 1 for 0 t'O, L(0) = a 1 . 


However, each of the functions with large values of n rises to a high above the 
limit value 1, at values of 9 tending to the origin with n, and with the value of the 



6.3] 


EFFICIENT LIKELIHOOD ESTIMATION 


443 


Figure 2.1. Risk functions R„(9) of the superefficient estimator S„ of Example 2.5 for a = .5. 



0 

peak tending to infinity with n. This is illustrated in Figure 2.1, where values of 
R n (9 ) are given for various values of n. 

As Figure 2.1 shows, the improvement (over X ) from 1 to a 2 in the limiting risk 
at the origin and hence for large finite n also near the origin, therefore, leads to an 
enormous increase in risk at points slightly further away but nevertheless close to 
the origin. (In this connection, see Problem 2.14.) j 


3 Efficient Likelihood Estimation 

Under smoothness assumptions similar to those of Theorem 2.6, we shall in the 
present section prove the existence of asymptotically efficient estimators and pro¬ 
vide a method for determining such estimators which, in many cases, leads to an 
explicit solution. 

We begin with the following assumptions: 

(AO) The distributions Pg of the observations are distinct (otherwise, 9 cannot be 
estimated consistently 2 ). 

(Al) The distributions Pg have common support. 

2 But see Redner (1981) for a different point of view 
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(A2) The observations are X = (Xi,..., X„), where the Xj are iid with probability 
density /(x,-1 6) with respect to ji. 

(A3) The parameter space ST2 contains an open set a> of which the true parameter 
value do is an interior point. 


Note: The true value of 0 will be denoted by 9q. 

The joint density of the sample, f(x 1 10) • • • f(x n \6) = fl/(x, 1 6), considered as 
a function of 0, plays a central role in statistical estimation, with a history dating 
back to the eighteenth century (see Note 10.1). 

Definition 3.1 For a sample point x = (x\, ..., x n ) from a density f(x\0). the 
likelihood function L(9 |x) = f(x\9) is the sample density considered as a function 
of 9 for fixed x . 

In the case of iid observations, we have L(9 |x) = n” =| f(xj\0). It is then often 
easier to work with the logarithm of the likelihood function, the log likelihood 
K9\x) = J2l l logf(x i \0). 

Theorem 3.2 Under assumptions (A0)-(A2), 

(3.1) Pe 0 (L(9 0 |X) > L(9 |X)) -» 1 as n ^ oo 

for any fixed 9 f 9 q. 

Proof. The inequality is equivalent to 

-Elog[/(Z ! |0)//(Z,|0 o )] <0. 

n 

By the law of large numbers, the left side tends in probability toward 

Eo 0 log[f(X\9)/f(X\9 0 )]. 

Since — log is strictly convex, Jensen’s inequality shows that 

E 6o \og[f(X\e)/f(X\9 0 )] < log E eo [f(X\9)/f(X\9 0 )] = 0, 

and the result follows. □ 

By (3.1), the density of X at the true 9q exceeds that at any other fixed 9 with 
high probability when n is large. We do not know 9q, but we can determine the 
value 9 of 9 which maximizes the density of X, that is, which maximizes the 
likelihood function at the observed X = x. If this value exists and is unique, it is 
the maximum likelihood estimator (MLE) of 9? The MLE of g(9) is defined to be 
g(9). If g is 1:1 and f = g(9 ), this agrees with the definition of f as the value of § 
that maximizes the likelihood, and the definition is consistent also in the case that 
g is not 1:1. (In this connection, see Zehna 1966 and Berk 1967b.) 

Theorem 3.2 suggests that if the density of X varies smoothly with 9 , the MLE 
of 9 typically should be close to the true value of 9, and hence be a reasonable 
estimator. 

3 For a more general definition, see Strasser (1985, Sections 64.4 and 84.2) or Scholz (1980. 1985). 
A discussion of the MLE as a summarizer of the data rather than an estimator is given by Efron 
(1982a). 
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Example 3.3 Binomial MLE. Let X have the binomial distribution 

b(p. n). Then, the MLE of p is obtained by maximizing 

is p = X/n (Problem 3.1). 

Example 3.4 Normal MLE. If X\,... , X n are iid as N(fi, cr 2 ), it is convenient to 
obtain the MLE by maximizing the logarithm of the density, —n log a — 1 /2a 2 E (x, 
— |) 2 — c. When (f. a) are both unknown, the maximizing values are § = x, a 2 = 
E(.r — x) 2 /n (Problem 3.3). || 


^ p x q" x and hence 


As a first question regarding the MLE for iid variables, let us ask whether it is 
consistent. We begin with the case in which is finite, so that 6 can take on only 
a finite number of values. In this case, a sequence 8 n is consistent if and only if 

(3.2) P g (8 n = 0) -* 1 for all 6 e £2 


(Problem 3.6). 

Corollary 3.5 Under assumptions (A0)-(A2) if £2 is finite, the MLE 9„ exists, it is 
unique with probability tending to 1, and it is consistent. 

Proof. The result is an immediate consequence of Theorem 3.2 and the fact that if 
P(Ai n ) -> 1 for i = 1,..., k, then also P[A\ n fl ■ ■ ■ fl A k„] -> 1 as n -> oo. □ 


The proof of Corollary 3.5 breaks down when Q is not restricted to be finite. 
That the consistency conclusion itself can break down even if Q is only countably 
infinite is shown by the following example due to Bahadur (1958) and Le Cam 
(1979b, 1990). 


Example 3.6 An inconsistent MLE. Let h be a continuous function defined on 
(0, 1], which is strictly decreasing, with h(x) > 1 for all 0 < x < 1 and satisfying 

(3.3) f h(x)dx = oo. 


Given a constant 0 < c < 1, let a^, k = 0, 1,..., be a sequence of constants defined 
inductively as follows: « (l = 1; given a {) , ..., a*_i, the constant a * is defined by 

r^k -1 

(3.4) / [h(x) - c] dx = 1 - c. 

Ja t 


It is easy to see that there exists a unique value 0 < ak < (H --1 satisfying (3.4) 
(Problem 3.8). Since the sequence {«/,} is decreasing, it tends to a limit a > 0. If 
a were > 0, the left side of (3.4) would tend to zero which is impossible. Thus, 
ak -> 0 as k -> oo. 

Consider now the sequence of densities 


fk(x) = 


c 

h(x) 


if x < Uf. or x > «/-_ | 
if ak < x < ak- 1 , 


and the problem of estimating the parameter k on the basis of independent obser¬ 
vations Xi ,..., X„ from fk- We shall show that the MLE exists and that it tends 
to infinity in probability regardless of the true value k (l of k and is, therefore, not 
consistent, provided h(x) —> oo sufficiently fast as x —> 0. 
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Let us denote the joint density of the X’s by 

Pk(*) = fk(x 1 ) • • • fk(x„). 

That the MLE exists follows from the fact that Ph(x) = c" < 1 for any value of 
k for which the interval 4 = (a*., a/c-i] contains none of the observations, so that 
the maximizing value of k must be one of the < n values for which 4 contains at 
least one of the x’s. 

For n = 1, the MLE is the value of k for which X \ e 4, and for n = 2, the MLE 
is the value of k for which X^, e 4- For n = 3, it may happen that one observation 
lies in 4 and two in I/(k < /), and whether the MLE is k or / then depends on 
whether c ■ h(X( p) is greater than or less than h(x(2))h(x(3)). 

We shall now prove that the MLE K„ (which is unique with probability tending 
to 1) tends to infinity in probability, that is, that 

(3.5) P(K„ > k) —»■ 1 for every k, 
provided h satisfies 

(3.6) h(x ) > e 1/x2 
for all sufficiently small values of x. 

To prove (3.5), we will show that for any fixed j. 


(3.7) 


P[Pk : (X) > Pj(X)] -> 1 as n 


oo 


where K* is the value of k for which X(\) e 4. Since p^ (X) > p^»(X), it then 
follows that for any fixed k. 


p[K n > k] > P[p£ n (X) > pj(X) for j = 1, • • •, k] 
To prove (3.7), consider 


Ljk = log 


fk(x i)- • • fk(x n ) 


= ^"log— _ E p 0 g 


h(xi) 


( 2 ), 


1. 


h(xt) 


fj(x 0 - • • fj(x„) 

where Z (11 and Y. (2> extend over all i for which x, e 4 and x,- e I j, respectively. 
Now xt e Ij implies that /z(x,) < h(cij), so that 

S (2) log [h(Xi)/c] < v jn \og[h(aj)/c] 

where Vj„ is the number of x’s in /,. Similarly, for k = K*, 

E (1 >log[/z(x,)/c] > log[/4x a) )/c] 

since log[/j(x)/c] > 0 for all x. Thus, 


1 ^ 1 h(x ( q) 

-Pj.K* > - log 


1 , h(cij) 

-Vj„ log ■ 


n n c n c 

Since Vj n /n tends in probability to P(X\ e 4) < L it only remains to show that 


1 

(3.8) — log h(X(i)) —*■ oo in probability. 

n 

Instead of X \ . X n , consider a sample Y\,... ,Y n from the uniform distri¬ 

bution U{ 0, 1/c). Then, for any x , P(Y l > x) > P(X, > x) and hence 


P[h(Y w ) > x]< P[/z(X (1) ) > x], 
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and it is therefore enough to prove that (l/«)log/;(T(i)) —»• oo in probability. If 
h satisfies (3.6), (1 /n) log /r(F(i)) > 1 /nY~ lr and the right side tends to infinity in 
probability since nY( 1) tends to a limit distribution (Problem 2.6). This completes 
the proof. i 

For later reference, note that the proof has established not only (2.5) but the fact 
that for any fixed A (Problem 3.9), 

(3.9) P[p K ,(X)> A n pj(X)]-> 1. 

The example suggests (and this suggestion will be verified in the next section) 
that also for densities depending smoothly on a continuously varying parameter 
9, the MLE need not be consistent. We shall now show, however, that a slightly 
weaker conclusion is possible under relatively mild conditions. Throughout the 
present section, we shall assume 9 to be real-valued. The case of several parameters 
will be taken up in Section 6.5. 

In the following, we shall frequently use the shorthand notation 1(6) for the log 
likelihood 

(3.10) /(0|x)=Elog/(ac i |0), 
and I'(9), l"(6), ... for its derivatives with respect to 6 . 

A way around the difficulty presented by this example was found by Cramer 
(1946a, 1946b), who replaced the search for a global maximum of the likelihood 
function with that for a local maximum. 

Theorem 3.7 Let X\,..., X n satisfy (A0)-(A3) and suppose that for almost all 
x, f(x\9) is differentiable with respect to 6 in co, with derivative f(x\9). Then, 
with probability tending to 1 as n -> oo, the likelihood equation 

(3.11) ^-1(9 | X ) = 0 

d V 

or, equivalently, the equation 

(3.12) f'(0|x) = E — =0 

f{xt\9) 

has a root 9 n = 9 n (x \,..., x n ) such that 9 n (X \,..., X n ) tends to the true value 9 q 
in probability. 

Proof Let a be small enough so that (9q — a, 9 q + a) C &>, and let 

(3.13) S„ = {x : 1(9q\x) > 1(9q — a|x) and /($o|x) > 1(9q + a|x)}. 

By Theorem 3.2, Pi h] (S„) 1. For any x e S„, there thus exists a value 9q — a < 
9 n < 9o + a at which 1(9) has a local maximum, so that l'(9 n ) = 0. Hence, for any 
a > 0 sufficiently small, there exists a sequence 9„ = 9 n (a) of roots such that 

(3.14) Pe 0 <\d n -6 0 \<a)^\. 

It remains to show that we can determine such a sequence, which does not depend 
on a. 

Let 9* be the root closest to 9q. [This exists because the limit of a sequence of 
roots is again a root by the continuity of/(g).] Then, clearly, Pb 0 (\9*—9q\ < a) —> 1 
and this completes the proof. □ 
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In connection with this theorem, the following comments should be noted. 

1. The proof yields the additional fact that with probability tending to 1, the 
roots 9 n (a) can be chosen to be local maxima and so, therefore, can the 9* if 
we let 9* be the closest root corresponding to a maximum. 

2. On the other hand, the theorem does not establish the existence of a consistent 
estimator sequence since, with the true value 9q unknown, the data do not tell 
us which root to choose so as to obtain a consistent sequence. An exception, 
of course, is the case in which the root is unique. 

3. It should also be emphasized that the existence of a root 9 n is not asserted for 
all x (or for a given n even for any x). This does not affect consistency, which 
only requires 9„ to be defined on a set S' n , the probability of which tends to 1 
as n -> oo. 

4. Although the likelihood equation can have many roots, the consistent se¬ 
quence of roots generated by Theorem 3.7 is essentially unique. For a more 
precise statement of this result, which is due to Huzurbazar (1948), see Prob¬ 
lem 3.28. 

5. Finally, there is a technical question concerning the measurability of the esti¬ 
mator sequence 9 n (a ), and hence of the sequence 9*. Recall from Section 1.2 
that 9„(a) is measurable function if the set [a : 9„(a) > t) is a measurable set 
for every t. Since 9 n (a) is defined implicitly, its measurability (and also that 
of 0*) is not immediately obvious. Happily, it turns out that the sequences 
9 n (a) and 0* are measurable. (For details, see Problem 3.29.) 

Corollary 3.8 Under the assumptions of Theorem 3.7, if the likelihood equation 
has a unique root S n for each n and all x, then {<$„} is a consistent sequence of 
estimators of 9. If in addition, the parameter space is an open inten’al (0, 9) (not 
necessarily finite), then with probability tending to 1, S n maximizes the likelihood, 
that is, S n is the MLE, which is therefore consistent. 

Proof. The first statement is obvious. To prove the second, suppose the probability 
of S n being the MLE does not tend to 1. Then, for sufficiently large values of n, the 
likelihood must tend to a supremum as 9 tends toward 9 or 9 with positive proba¬ 
bility. Now with probability tending to 1, S n is a local maximum. This contradicts 
the assumed uniqueness of the root. □ 

The conclusion of Corollary 3.8 holds, of course, not only when the root of the 
likelihood equation is unique but also when the probability of multiple roots tends 
to zero as n —» oo. On the other hand, even when the root is unique, the corollary 
says nothing about its properties for finite n. 

Example 3.9 Minimum likelihood. Let X take on the values 0, 1, 2 with proba¬ 
bilities 6 9 2 — 49 + 1, 9 — 2 9 2 , and 3 9 — 49 2 (0 < 9 < 1 /2). Then, the likelihood 
equation has a unique root for all x, which is a minimum for x = 0 and a maximum 
for* = 1 and 2 (Problem 3.11). || 

Theorem 3.7 establishes the existence of a consistent root of the likelihood 
equation. The next theorem asserts that any such sequence is asymptotically normal 
and efficient. 
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Theorem 3.10 Suppose that X\,..., X n are iid and satisfy the assumptions of 
Theorem 2.6, with (c) and (d) replaced by the corresponding assumptions on 
the third (rather than the second) derivative, that is, by the existence of a third 
derivative satisfying 


(3.15) 


a 3 

903 


log/(x|0) 


< M(x) 


for all x e A, 


0o — c < 6 <0o + c 


with 

(3.16) E 6o [M(X)\< oo. 

Then, any consistent sequence 0„ = 9„(X i, ..., X n ) of roots of the likelihood 
equation satisfies 

(3.17) M9n -9) An (a 44). 

We shall call such a sequence 0„ an efficient likelihood estimator (ELE) of 0. It 
is typically (but need not be, see Example 4.1) provided by the MLE. Note also 
that any sequence 0* satisfying (3.19) is asymptotically efficient in the sense of 
Definition 2.4. 

Proof of Theorem 3.10. For any fixed x, expand l'(0 n ) about 0o, 

/'(&) = I’m + (0„ - 0 o )/"(0o) + X -(9 n - 9 q) 2 1"'(9*) 

where 0* lies between 0 q and 0„. By assumption, the left side is zero, so that 


\fn(0„ - 0 O ) = 


_ (l/»/'(0o) _ 

-(l/n)/"(0 o ) - (l/2n)(0„ - 0 o )/'"(0*) 


where it should be remembered that 1(0), l'(0), and so on are functions of X as 
well as 0. We shall show that 


(3.18) 4=/'(0 o )4 N[O,/(0 o )], 

V« 

that 

(3.19) - -/"(0 O ) 4 I (Of) 

n 

and that 

(3.20) -l"'(9*) is bounded in probability. 
n 

The desired result then follows from Theorem 1.8.10. 

Of the above statements, (3.18) follows from the fact that 


- r l\9f = 
Jn 


-£ 

n A —/ 


mm j-i m,i0 O )■ 
- Ee o 


mm 


f(Xi\0 0 ) 


since the expectation term is zero, and then from the central limit theorem (CLT) 
and the definition of 1(0). 
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Next, (3.19) follows because 

1 _ 1 v- /“(MW - nx,mrixim 

~n ((W ' 7 2- WW • 

and, by the law of large numbers, this tends in probability to 

nxM) 

= m). 


f(Xi |0o) 


Finally, (3.20) is established by noting 


-rXw l ° gf(Xl m 


so that by (3.15), 


1 

n 



< -{M{X x )+--- + M(X n )] 
n 


with probability tending to 1. The right side tends in probability to 
Eq 0 [M(X )], and this completes the proof. □ 

Although the conclusions of Theorem 3.10 are quite far-reaching, the proof is 
remarkably easy. The reason is that Theorem 3.7 already puts 9„ into the neighbor¬ 
hood of the true value 9q, so that an expansion about 9q essentially linearizes the 
problem and thereby prepares the way for application of the central limit theorem. 

Corollary 3.11 Under the assumptions of Theorem 3.10, if the likelihood equation 
has a unique root for all n and x, and more generally if the probability of multiple 
roots tends to zero as n —>■ oo, the MLE is asymptotically efficient. 

To establish the assumptions of Theorem 3.10, one must verify the following 
two conditions that may not be obvious. 

(a) That f f{x\9)dp{x) can be differentiated twice with respect to 0 by differ¬ 
entiating under the integral sign. 

(b) The third derivative is uniformly bounded by an integrable function [see 

(3.15)]. 

Conditions when (a) holds are given in books on calculus (see also Casella 
and Berger 1990, Section 2.4) although it is often easier simply to calculate the 
difference quotient and pass to the limit. 

Condition (b) is usually easy to check after realizing that it is not necessary for 

(3.15) to hold for all 0, but that it is enough if there exist 9\ < 9q < such that 

(3.15) holds for all 6>i < 9 < d 2 . 

Example 3.12 One-parameter exponential family. Let X \,..., X n be iid ac¬ 
cording to a one-parameter exponential family with density 


(3.21) f(x i \i 1 ) = e’ t ™- AW 

with respect to a a-finite measure //, and let the estimand be >). The likelihood 
equation is 

(3.22) -T,T(xi) = A'(q), 

n 
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which, by (1.5.14), is equivalent to 

(3.23) E n [T{X j )] = -YT(x i ). 

n 

The left side of (3.23) is a strictly increasing function of ;/ since, by (1.5.15), 
^[E„T(Xj)]=Y^T(Xj)>0. 

It follows that Equation (3.23) has at most one solution. The conditions of Theorem 
3.10 are easily checked in the present case. In particular, condition (a) follows 
from Theorem 1.5.8 and (b) from the fact that the third derivative of log f(x\rj) 
is independent of x and a continuous function of q. With probability tending 
to 1, (3.23) therefore has a solution fj. This solution is unique, consistent, and 
asymptotically efficient, so that 

(3.24) VTi(q - /;) 4 N (0, var T) 

where T = T(Xj) and the asymptotic variance follows from (2.5.18). i 


Example 3.13 Truncated normal. As an illustration of the preceding example, 
consider a sample of n observations from a normal distribution A(§, 1), truncated 
at two fixed points a < b. The density of a single X is then 


\fljt 


exp 


1 


;(x - f ) 2 


/ [d>(b - $) - 4>(a - $)]. 


a < x < b, 


which satisfies (3.21) with q = £, T(x) = x. An ELE will therefore be the unique 
solution of Ej;{X ) = x if it exists. To see that this equation has a solution for any 
value a < x < b, note that as f —> —oo or +oo, X tends in probability to a or 
b, respectively (Problem 3.12). Since X is bounded, this implies that also E^X) 
tends to a or b. Since Er(X) is continuous, the existence of § follows. j 


For densities that are members of location or scale families, it is fairly straight¬ 
forward to determine the existence and behavior of the MLE. (See Problems 3.15 
-3.19.) 

We turn to one last example, which is not covered by Theorem 3.10. 

Example 3.14 Double exponential. For the double exponential density DE(9, 1) 
given in Table 1.5.1, it is not true that for all (or almost all) x, fix — 6) is differ¬ 
entiable with respect to 9, since for every x there exists a value (9 = x) at which 
the derivative does not exist. Despite this failure, the MLE (which is the median 
of the A’s) satisfies the conclusion of Theorem 3.10 and is asymptotically normal 
with variance 1/n (see Problem 3.25). This was established by Daniels (1961), 
who proved a general theorem, not requiring differentiability of the density, that 
was motivated by this problem. (See Note 10.2.) | 


4 Likelihood Estimation: Multiple Roots 

When the likelihood equation has multiple roots, the assumptions of Theorem 3.10 
are no longer sufficient to guarantee consistency of the MLE, even when it exists 



452 


ASYMPTOTIC OPTIMALITY 


[6.4 


for all n. This is shown by the following example due to Le Cam (1979b, 1990), 
which is obtained by embedding the sequence { f k } of Example 3.6 in a sufficiently 
smooth continuous-parameter family. 

Example 4.1 Continuation of Example 3.6. For k < 9 < k + 1, k = 1, 2,..., 
let 

(4.1) f(x\9) = [1 - u(0 - k)]Mx) + u(6 - k)f k+ jW, 

with f k defined as in Example 3.6 and u defined on (—oo, oo) such that u(x) = 0 
for x < 0 and u(x ) = 1 for x > 1 is strictly increasing on (0, 1) and infinitely 
differentiable on (—oo, oo) (Problem 4.1). Let X \, ..., X n be iid, each with density 
f(x\9), and let p(x\9) = Y\f(xi\9). 

Since for any given x, the density p(x\9) is bounded and continuous in 9 and is 
equal to c” for all sufficiently large 9 and greater than c" for some 9, it takes on 
its maximum for some finite 9 , and the MLE 9 n therefore exists. 

To see that 9 n -> oo in probability, note that for k < 9 < k + 1, 

(4.2) p(x\9) < n max[/*(*,-), /*+i(*,-)] = Pk (*)■ 

If K„ and K* are defined as in Example 3.6, the argument of that example shows 
that it is enough to prove that for any fixed j, 


(4.3) 


P[p K; (X) > ;^ 7 (X)] -> 1 as n —> oo, 


where p k (x) = p(x\k). Now 


Ljk = 


Pk(x) 

Pj(x) 


E (1, log 


h(xi) 


E (2) log 


h(xj) 

c 


E (3) log 


h(xj) 

c 


where E (1) , E (2) , and E (3) extend overall / forwhich.r,- e I k ,Xi e lj, andx, e I ]+ \, 
respectively. The argument is now completed as before to show that 9 n —> oo in 
probability regardless of the true value of 6 and is therefore not consistent. 

The example is not yet completely satisfactory since df(x\9)/d6 = 0 and, hence, 
1(9) = 0 for 9 = 1,2,... [The remaining conditions of Theorem 3.10 are easily 
checked (Problem 4.2).] To remove this difficulty, define 


(4.4) 


g(x\9)= l -[f(x\9) + f(x\9 + ae- e2 )], 9 > 1 , 


for some fixed a < 1. 

\f X\X n are iid according to g(x\9), we shall now show that the MLE 0„ 
continues to tend to infinity for any fixed 9. We have, as before 


P[9„ > k] > P[Ylg(xi\K*) > Ug(xi\9) for all 9 < k ] 


> P 


^Ylf( Xi \K* n ) > n (i [f(Xi\9)+ f( Xi \9 +ae~ e2 )] 


For j < 9 < j + 1, it is seen from (4.1) that [f(xi\9) + f(xj\9 + ae~ 0 ~)]/2 is 
a weighted average of fj(xj), fj+\(xj ), and possibly fj+ 2 (xj). By using pj(x) = 
nmaxl/jfc), fj+\(x, ), fj+i(xi)\ in place of pj(x ), the proof can now be com¬ 
pleted as before. Since the densities g(xi\9) satisfy the conditions of Theorem 
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3.10 (Problem 4.3), these conditions are therefore not enough to ensure the con¬ 
sistency of the MLE. (For another example, see Ferguson 1982.) i 


Even under the assumptions of Theorem 3.10, one is thus, in the case of multiple 
roots, still faced with the problem of identifying a consistent sequence of roots. 
Following are three possible approaches. 

(a) In many cases, the maximum likelihood estimator is consistent. Conditions 
which ensure this were given by, among others, Wald (1949), Wolfowitz 
(1965), Le Cam (1953, 1955, 1970), Kiefer and Wolfowitz (1956), Kraft 
and Le Cam (1956), Bahadur (1967), and Perlman (1972). A survey of the 
literature can be found in Perlman (1983). This material is technically difficult, 
and even when the conditions are satisfied, the determination of the MLE may 
present problems (see Barnett 1966). We shall therefore turn to somewhat 
simpler alternatives. 

The following two methods require that some sequence of consistent (but not 
necessarily efficient) estimators be available. In any given situation, it is usually 
easy to construct a consistent sequence, as will be illustrated below and in the next 
section. 

(b) Suppose that S„ is any consistent estimator of 9 and that the assumptions of 
Theorem 3.10 hold. Then, the root 9 n of the likelihood equation closest to S„ 
(which exists by the proof of Theorem 3.7) is also consistent, and hence is 
efficient by Theorem 3.10. 

To see this, note that by Theorem 3.10, there exists a consistent sequence of 
roots, say 0*. Since 0* — 8 n —*■ 0 in probability, so does 9„ — 8 n . 

The following approach, which does not require the determination of the closest 
root and in which the estimators are no longer exact roots of the likelihood equation, 
is often more convenient. 


(c) The usual iterative methods for solving the likelihood equation 
(4.5) l'(9) = 0 


are based on replacing the left side by the linear terms of its Taylor expansion 
about an approximate solution 0. If 6 denotes a root of (4.5), this leads to the 
approximation 


(4.6) 


o = l'(0) = l\e ) + (9 - 9)1"(9), 


and hence to 


(4.7) 


9 = 9 


m 

i'm 


The procedure is then iterated by replacing 9 by the value 9 of the right 
side of (4.7), and so on. This is the Newton-Raphson iterative process. (For 
a discussion of the performance of this procedure, see, for example, Barnett 
1966, Stuart and Ord 1991, Section 18.21, or Searle et al. 1992, Section 8.2.) 


Here, we are concerned only with the first step and with the performance of 
the one-step approximation (4.7) as an estimator of 9. The following result gives 
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conditions on 9 under which the resulting sequence of estimators is consistent, 
asymptotically normal, and efficient. It relies on the sequence of estimators pos¬ 
sessing the following property. 


Definition 4.2 A sequence of estimators S„ is *Jn-consistent for 0 if */n(8„ — 9) 
is bounded in probability, that is, if 8„ — 9 = O p (l/^/n). 


Theorem 4.3 Suppose that the assumptions of Theorem 3.10 hold and that 9 n is 
not only a consistent but a ^fn-consistent 4 estimator of 9. Then, the estimator 
sequence 


(4.8) 


8n = 0„ 


ne„) 


is asymptotically efficient, that is, it satisfies (3.17) with 8 n in place 


of 9„. 


Proof. As in the proof of Theorem 3.10, expand l'(9„) about 9q as 


I'm = I'm + (o„ - o Q )i"m + ^e n - Oofi'"(0 


where 0* lies between 9q and 9„. Substituting this expression into (4.8) and sim¬ 
plifying, we find 


(4.9) sjn(8 n - 9 0 ) =-— + y/n(G„ - 9 0 ) 

-(1 /n)l"m 

l” (Go) 1 - 

1 - -m- - ~(e n - Go) n> 


i"m 2 

The result now follows from the following facts: 

(\/^i)rm c 


(a) 

(b) 

(c) 

(d) 


-(l/n)/"(0 o ) 

\fn(G n ~ Go) = O p ( 1) 

i'"(e* n ) 


N( o, I~\9 o)] 


i"m 

i"m 


= o,(D 


1 in probability 


i n m 

Here, (d) follows from the fact that 


i n m 


[(3.18) and (3.19)] 
[assumption] 
[(3.19) and (3.20)] 

[see below] 


(4.10) 


-i"m = -i"m + ~(d n - 9 0 )i"'(9 „**), 

n n n 


for some 9** between 9q and 9„. Now (3.19), (3.20), and consistency of 9 n applied 
to (4.10) imply (d). In turn, (b)-(d) show that the entire second term in (4.9) 
converges to zero in probability, and (a) shows that the first term has the correct 
limit distribution. □ 


4 A general method for constructing ^/IT-consistent estimators is given by Le Cam (1969, p. 103). 
See also Bickel et al. (1993). 
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Corollary 4.4 Suppose that the assumptions of Theorem 4.3 hold and that the 
Fisher information 1(9) is a continuous function of 6. Then, the estimator 


(4.11) 


On + 


I'iOn) 

nl(0„) 


is also asymptotically efficient. 

Proof. By fd) in the proof of Theorem 4.3, condition (h) of Theorem 2.6, and 
the law of large numbers, —(1 /n)l"(O n ) —> I (Of) in probability. Also, since 1(6) 
is continuous, 1(0,,) -a- I(9q) in probability, so that —(1 /n)l"(6„)/I(6„) -4- 1 in 
probability, and this completes the proof. □ 

The estimators (4.8) and (4.11) are compared by Stuart (1958), who gives a 
heuristic argument why (4.11) might be expected to be closer to the ELE than 
(4.8) and provides a numerical example supporting this argument. See also Efron 
and Hinkley 1978 and Lindsay and Yi 1996. 

Example 4.5 Location parameter. Consider the case of a symmetric location 
family, with density f(x — 6), in which the likelihood equation 


(4.12) 


f'iXj - 0) 
^ f(*i ~ 0) 


has multiple roots. [For the Cauchy distribution, for example, it has been shown 
by Reeds (1985) that if (4.12) has K + 1 roots, then as n -* oo, K tends in law 
to a Poisson distribution with expectation 1 /tt. The Cauchy case has also been 
considered by Barnett (1966) and Bai and Fu (1987).] If var( X) < oo, it follows 
from the CLT that the sample mean X„ is ffin -consistent and that an asymptotically 
efficient estimator of 6 is therefore provided by (4.8) or (4.11) with 0 n = X as long 
as f(x — 0) satisfies the conditions of Theorem 3.10. For distributions such as the 
Cauchy for which E(X 2 ) = oo, one can, instead, take for 0„ the sample median 
provided /(0) > 0; other robust estimators provide still further possibilities (see, 
for example, Huber 1973, 1981 or Haberman 1989). || 


Example 4.6 Grouped or censored observations. Suppose that X\, _ X„ are 

iid according to a location family with cdf F(x — 6), with F known and with 
0 < F(x) < 1 for all x, but that it is only observed whether each A, falls below 
a, between a and /;, or above b where a < b are two given constants. The n 
observations constitute n trinomial trials with probabilities p\ = p\(0) = F(a — 0), 
pi(0) = F(b — 0) — F(a — 6), pfO) = 1 — F(b — 9) for the three outcomes. If V 
denotes the number of observations less than a, then 


(4.13) 


*/n 


V 

n 


Pi 


MO, Pi( 1 - Pi)] 


and, by Theorem 1.8.12, 
(4.14) 



is a ^/n --consistent estimator of 9. Since the estimator is not defined when V = 0 or 
V = n, some special definition has to be adopted in these cases whose probability 
however tends to zero as n -> oo. 
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If the trinomial distribution for a single trial satisfies the assumptions of The¬ 
orem 3.10 as will be the case under mild assumptions on F, the estimator (4.8) 
is asymptotically efficient (but see the comment following Example 7.15). The 
approach applies, of course, equally to the case of more than three groups. 

A very similar situation arises when the X’s are censored , say at a fixed point 
a. For example, they might be lengths of life of light bulbs or patients, with obser¬ 
vation discontinued at time a. The observations can then be represented as 


(4.15) 


Xj if Xi < a 
a if Xj > a. 


Here, the value a of Y, when A, > a has no significance; it simply indicates that 
the value of X, is > a. The T’s are then iid with density 


(4.16) 


g(y\0) = 


f(y -9) if y <a 
1 — F(a — 9) if y = a 


with respect to the measure /x which is Lebesgue measure on (—oo, a) and assigns 
measure 1 to the point y = a. 

The estimator (4.14) continues to be ^/n-consistent in the present situation. An 
alternative starting point is, for example, the best linear combination of the ordered 
X’s less than a (see, for example, Chan 1967). j 


Example 4.7 Mixtures. Let X\,... , X„ be a sample from a distribution 0G + (\ — 
9)FI, 0 <9 < 1, where G and FI are two specified distributions with densities g 
and h. The log likelihood of a single observation is a concave function of 9 , and 
so therefore is the log likelihood of a sample (Problem 4.5). It follows that the 
likelihood equation has at most one solution. [The asymptotic performance of the 
ML estimator is studied by Hill (1963).] 

Even when the root is unique, as it is here. Theorem 4.3 provides an alternative, 
which may be more convenient than the MLE. In the mixture problem, as in many 
other cases, a ~Jn- -consistent estimator can be obtained by the method of moments, 
which consists in equating the first k moments of X to the corresponding sample 
moments, say 

(4.17) Eo(X' i )=-Yx'j, r = l,...,k, 

n M 

where k is the number of unknown parameters. (For further discussion, see, for 
example, Cramer 1946a, Section 33.1 and Serfling 1980, Section 4.3.1). In the 
present case, suppose that E(Xj) = ? or q when X is distributed as G or H where 
i] f £ and G and H have finite variance. Since k = 1, the method of moments 
estimates 9 as the solution of the equation 

$e + 1 ?( i - 9) = x n 


and hence by 


[If >1 = k but the second moments of X, under FI and G differ, one can, instead, 
equate E(Xf) with YLX^/n (Problem 4.6).] An asymptotically efficient estimator 
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is then provided by (4.8). 

Estimation under a mixture distribution provides interesting challenges, and has 
many application in practice. There is a large literature on mixtures, and entry can 
be found through the books by Everitt and Hand (1981), Titterington et al. (1985), 
McLachlan and Basford (1988), and McLachlan (1997). || 


In the context of choosing a ^/n-consistent estimator 9„ for (4.8), it is of interest 
to note that in sufficiently regular situations good efficiency of 9„ is equivalent to 
high correlation with 9„. This is made precise by the following result, which is 
concerned only with first-order approximations. 

Theorem 4.8 Suppose 9„ is an ELE estimator and 0„ a ~Jn-consistent estimator, 
for which the joint distribution of 

T„ = Vn(9 n - 9) and T' n = «Jn(9„ - 9) 


tends to a bivariate limit distribution H with zero means and covariance matrix 
Y = |ijjj 11. Let (T , T') have distribution H and suppose that the means and 
covariance matrix of(T n , Tf) tend toward those of(T , T') as n —> oo. Then, 


(4.18) 


var T 
var T' 


where p = cr n/is the correlation coefficient of(T, T'). 


Proof Consider var[(l — a)T n + ctTjj] which tends to 

(4.19) var[(l — a)T + aT') = (1 — a) 2 o r n + 2a(l — 0)042 + cro^- 


This is non-negative for all values of o and takes on its minimum at o = 0 since 0„ 
is asymptotically efficient. Evaluating the derivative of (4.19) at o = 0 shows that 
we must have on = 012 (Problem 4.7). Thus, p = ^/ow/ 022 , as was to be proved. 

□ 


The ratio of the asymptotic variances in (4.18) is a special case of asymptotic 
relative efficiency (ARE). See Definition 6.6. 

In Examples 4.6 and 4.7, we used the method of moments to obtain in¬ 
consistent estimators and then applied the one-step estimator (4.8) or (4.11). An 
alternative approach, when the direct calculation of an ELE is difficult, is the fol¬ 
lowing expectation-maximization (EM) algorithm for obtaining a stationary point 
of the likelihood. 

The idea behind the EM algorithm is to replace one computationally difficult 
likelihood maximization with a sequence of easier maximizations whose limit is the 
answer to the original problem. More precisely, let Y\, ..., Y n be iid with density 
g(v|0), and suppose that the object is to compute the value 9 that maximizes 
my) = IX ■ 1 g(yi\6)- If L(9 |y) is difficult to work with, we can sometimes 
augment the data y = (vi, ..., }’„) and create a new likelihood function L(9 |y, z) 
that has a simpler form. 

Example 4.9 Censored data likelihood. Suppose that we observe Y\, ..., Y n , 
iid, with density (4.16), and we have ordered the observations so that (yi,..., y m ) 
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are uncensored and (y m+ \ , ..., y n ) are censored (and equal to a). The likelihood 
function is then 

n m 

(4.20) L(9 |y) = f[ g(y, 1 9) = [~[ f(y, |0) [1 - F(a - 9)] n ~ m . 

1=1 1=1 

If we had observed the last n — m values, say z = (z m +i, ■. ■, Z n ), the likelihood 
would have had the simpler form 

m n 

L(e\y,z) = Y[f( yi \0) ]“[ f( Zi \6). 

i =1 i=m+l 

More generally, the EM algorithm is useful when the density of interest, g(y,- \0), 
can be expressed as 

(4.21) g(y| 0 ) = J f(j,z\0)dz, 

for some simpler function /(y, z| 9). The z vector merely serves to simplify cal¬ 
culations, and its choice does not affect the value of the estimator. An illustration 
of a typical construction of the density / is the case of "filling in” missing data, 
for example, by turning an unbalanced data set into a balanced one. 

Example 4.10 EM in a one-way layout. In a one-way layout (Example 3.4.9), 
suppose there are four treatments with the following data 

Treatments 


1 

2 

3 

4 

Mi 

yn 

M3 

M4 

yn 

yn 

y23 

V24 

Z I 

y3 2 

Z3 

V34 


where the v,/s represent the observed data, and the dummy variables z i and z.3 
represent missing observations. Under the usual assumptions, the T,/s are inde¬ 
pendently normally distributed as N(/j. +a a 2 ). If we let 9 = (ji, oq,..., 04 , a 2 ) 
and let denote the number of observations per treatment, the incomplete-data 
likelihood is given by 

my) = 8(y\0) = eEtl zUyij-^ 2 /* 2 

while the complete-data likelihood is 

L(0\y, z) = /(y, z| 9) = 2 j 

where y '31 = z,\ and V 33 = Z3. By integrating out z 1 and 23 , the original likelihood 
is recovered. 

Although estimation in the original problem (with only the yq- ’s) is not difficult, 
it is easier in the augmented problem. [The computational advantage of the EM 
algorithm becomes more obvious as we move to higher-order designs, for example, 
the two-way layout (see Problem 4.14).] i 
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The EM algorithm is often useful for obtaining an MLE when, as in Example 
4.10, we should like to maximize L(9 |y), but it would be much easier, if certain 
additional observations z were available, to work with the joint density /(y, z| 0) = 
L(6\y, z) and the conditional density of Z given y, that is, 

f( y, z| 9) 

(4.22) L(9\y, z) and *(z|0,y)= ' 3 '- . 

s(y|0) 

These quantities are related by the identity 

(4.23) log L(9 |y) = log L(9\y, z) - log k(z\0, y). 

Since z is not available, we replace the right side of (4.23) with its expectation, 
using the conditional distribution of Z given y. With an initial guess 9o (to start 
the iterations), we define 

(4.24) Q(9\9o, y)= J logL(9\yk(z\9 0 ,y)dz, 

H(9\9 0 ,y)= J log k(z\9, y)|0 o , y) dz. 

As the left side of (4.23) does not depend on z, the expected value of log L(9 |y) is 
then given by 

(4.25) L(9 |y) = Q(9\9 0 , y) - H(0\9 O , y). 

Let the value of 6 maximizing Q(9\0o, y) be %). The process is then repeated 
with 9q in (4.24) and (4.22) replaced by the updated value 0(\), so that (4.24) 
is replaced by Q(9\9(i),y). In this manner, a sequence of estimators 9(j), j = 
1,2,... is obtained iteratively where 6(j) is defined as the value of 9 maximizing 

G(0|%_U’y)’ thatis ’ 

(4.26) Q(9 U) \9 U -i), y) = max Q(9\9 U -i ) , y). 

6 

(It is sometimes written 9^ = argmax 0 Q(9\9(j-\), y), that is, 9(j) is the value of 
the argument 9 that maximizes Q.) 

The quantities log L(@|y), log L(9 |y, z), and Q(9\9q, y) are referred to as the 
incomplete, complete, and expected log likelihood. The term EM for this algorithm 
stands for Expectation-Maximization since the /'th step of the iteration consists 
of the calculating the expectation (4.24), with 9q replaced by 0<j-\), and then 
maximizing it. 

The following is a key property of the sequence { 0 (j ,}. 

Theorem 4.11 The sequence {0<j)} defined by (4.26) satisfies 

(4.27) L(% +1) |y) > L(%)|y), 

with equality holding if and only if Q(9(j + i)\9(p, y) = <2(%)l%), y)- 

Proof. On successive iterations, the difference between the logarithms of the left 
and right sides of (4.25) is 

logL(% +1) |y) - log L(9^)\y) 
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(4.28) = [Q(d u+1) \9 uh y) - Q(9 U) \9 uh y)] 

- [^(% + i)l%)> y) - y)] • 


The first expression in (4.28) is non-negative by definition of (9( ;+ i). It remains to 
show that the second term is non-negative, that is, 

(4.29) J j^log k(z\0(j+\), y) - log k(z\d U) , y)j k(z\9 U) , y)dz < 0. 

Since the difference of the logarithms is the logarithm of the ratio, this integral can 
be written as 


(4.30) 



fc(z|%+n,y) 

k(z\9 U) ,y) 


k(z\0(j), y)dz < log J k(z\9 u+i) , 


y)dz = 0. 


The inequality follows from Jensen’s inequality (see Example 1.7.7, Inequality 
(1.7.13), and Problem 4.17), and this completes the proof. □ 


Although Theorem 4.11 guarantees that the likelihood will increase at each itera¬ 
tion, we still may not be able to conclude that the sequence {P<j)} converges to a 
maximum likelihood estimator. 

To ensure convergence, we require further conditions on the mapping 0 {J) —> 
9(j+i). These conditions are investigated by Boyles (1983) and Wu (1983); see also 
Finch et al. 1989. The following theorem is, perhaps, the most easily applicable 
condition to guarantee convergence to a stationary point , which may be a local 
maximum or saddlepoint. 


Theorem 4.12 If the expected complete-data likelihood Q(9\9o, y) is continuous 
in both 9 and 9 q, then all limit points of an EM sequence {@<j)} are stationary points 
of L(9 |y), and L(9^^\y) converges monotonically to L(9\y) for some stationary 
point 9. 


Example 4.13 Continuation of Example 4.9. The situation of Example 4.9 does 
not quite fit the conditions under which the EM algorithm was described above 
since the observations y m+ \. , y n are not missing completely but only partially. 

(We know that they are > a.) However, the situation reduces to the earlier one if we 
just ignore y m +\, so that y now stands for (yi,..., y m ). To be specific, let 

the density f(y\9) of (4.16) be the N(9. 1) density, so that the likelihood function 
(4.20) is 


L(9\y ): 


1 


(2tt)"7 2 


o-hHUy-o? 


We replace y m+ \,... ,y n with n — m phantom variables z = (zi ,..., z n -m ) which 
are distributed as n — m iid variables from the conditional normal distribution given 
that they are all > a; thus, for z,- > a, i = 1 ,..., n — m. 


k(z\9,y) = 


i exp MKirfo-A) 2 } 

(V2^)(”-'"V2 [1 — 0(« — 9)] n ~ m 
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At the / th step in the EM sequence, we have 

n—m noo 


j m | n—m pc 

Q(0\e u) , y) OC -- - of - 1 X! / _ 0) 2 k{z\e (j) , y)dz t , 

2 1=1 2 1=1 Ja 

and differentiating with respect to 9 yields 

m(y — 9) + (n — m) ^£(Z|0( 7 -)) — = 0 


7 U+ 1) : 


my + (n — m)E(Z\9 ( j ) ) 


where 


E(Z\9 (j) )= f zk(z\ 9 U) ,y)dz = 9 (j) + —^ 
J a 1 — 

Thus, the EM sequence is defined by 


<P( a - %)) 


- m _ n — m 

%+d = —y + —-— 


>u) ■ 


3>(fl - 9(j ) ) 
<p(a - 0(j)) 


1 - - %) 


which converges to the MLE 9 (Problem 4.8). 


Quite generally, in an exponential family, computations are somewhat simplified 
because we can write 

Q(9\G u) ,y) = E §u) [log L(9\y, Z)|y)] 

= E$ (jt [log (h(y, Z) e ^ m ~ B ^\y 

= E §u> [log /i(y, Z)] + J2 'li(0)E §U) [7}|y] - B(9). 

Thus, calculating the complete-data MLE only involves the simpler expectation 

E eJ T i\yl 

The books by Little and Rubin (1987), Tanner (1996), and McLachlan and 
Krishnan (1997) provide good overviews of the EM literature. Other references 
include Louis (1982), Laird et al. (1987), Meng and Rubin (1993), Smith and 
Roberts (1993), and Liu and Rubin (1994). 


5 The Multiparameter Case 

In the preceding sections, asymptotically efficient estimators were obtained when 
the distribution depends on a single parameter 9. When extending this theory to 
probability models involving several parameters 9 \, ..., 9 S , one may be interested 
either in the simultaneous estimation of these parameters (or certain functions 
of them) or with the estimation of one of the parameters at a time, the remaining 
parameters then playing the role of nuisance or incidental parameters. In the present 
section, we shall primarily take the latter point of view. 

Let X\,... ,X n be iid with a distribution that depends on 6 = (9i,..9 S ) and 
satisfies assumptions (A0)-(A3) of Section 6.3. For the time being, we shall assume 
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s to be fixed. Suppose we wish to estimate Oj. Then, it was seen in Section 2.6 that 
the variance of any unbiased estimator <5„ of Oj, based on n observations, satisfies 
the inequality 

(5.1) var(S„) > [I(0)Y]j/n 

where the numerator on the right side is the jj th element of the inverse of the 
information matrix 1(0) with elements lj k (0 ), j, k = I ......y, defined by 

(5.2) I jk (0) = cov -^log/(X \0), 1-log f(X\0) . 

oOj aO k 

It was further shown by Bahadur (1964) under conditions analogous to those of 
Theorem 2.6 that for any sequence of estimators S„ of Oj satisfying 

(5.3) ^(8n-0j)S N(O,v(0)], 
the asymptotic variance v satisfies 

(5.4) v(0) > umjj, 

except on a set of values 0 having measure zero. 

We shall now show under assumptions generalizing those of Theorem 3.10 
that with probability tending to 1, there exist solutions 0„ = (6 ln ,..., 6 S „) of the 
likelihood equations 

(5.5) ll[f(xi\0)---f(x n \0)]=O, j = 
ddj 

or, equivalently, 

(5.6) A [/( 0 )] = o, j = 1 ,..., 5 , 
d9j 

such that 9j„ is consistent for estimating 6j and asymptotically efficient in the 
sense of satisfying (5.3) with 

(5.7) v(0)=[I(0)] 

We state first some assumptions: 

(A) There exists an open subset oj of Q containing the true parameter point 0 {) 
such that for almost all x, the density /(x \0) admits all third derivatives 
(9 3 /3^3^30 /)/(x| 6») for all 0 e cd. 

(B) The first and second logarithmic derivatives of / satisfy the equations 

(5.8) Eg -1- log f(X\0) =0 for j = 1,s 

L 3 Oj J 

and 

~ 3 3 

(5.9) I jk (0) = E e — log f(X\0) ■ — log /(X \0) 

_ OUj OtJ/c 

Clearly, (5.8) and (5.9) imply (5.2). 
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(C) Since the jxs matrix 7(0) is a covariance matrix, it is positive semidefinite. 
In generalization of condition (v) of Theorem 2.6, we shall assume that the 
ljk(0) are finite and that the matrix 1(0) is positive definite for all 0 in u>, and 
hence that the s statistics 

ttrr log /(X|0),. ■ •, log /(X|0) 

o u i (J u$ 

are affinely independent with probability 1. 

(D) Finally, we shall suppose that there exist functions Mju such that 

^-T_log/(x|#)j<M JU (x) for all It G aj 

where 

m.j k i = E e o[M jk i(X)] < oo for all k , /. 

Theorem 5.1 Let X \X„ be iid, each with a density f(x\0) (with respect to 
p.) which satisfies (A0)-(A2) of Section 6.3 and assumptions (A)-(D) above. Then, 
with probability tending to 1 as n —> oo, there exist solutions 0„ = 0 n (X\, ..., X„) 
of the likelihood equations such that 

(a) 0j n is consistent for estimating 9j, 

(b) *Jn(0 n —0) is asymptotically normal with (vector) mean zero and covariance 
matrix [7(0)] _1 , and 

(c) 8j n is asymptotically efficient in the sense that 

(5.10) M0 jn - 6j) 4 N{ 0, [7(0)17/}. 

Proof, (a) Existence and Consistency. To prove the existence, with probability 
tending to 1, of a sequence of solutions of the likelihood equations which is con¬ 
sistent, we shall consider the behavior of the log likelihood 1(0) on the sphere Q„ 
with center at the tme point 0° and radius a. We will show that for any sufficiently 
small a, the probability tends to 1 that 

1 ( 0 ) < 1 ( 0 °) 

at all points 0 on the surface of Q a , and hence that 1(0) has a local maximum 
in the interior of Q a . Since at a local maximum the likelihood equations must 
be satisfied, it will follow that for any a > 0, with probability tending to 1 as 
n —> oo, the likelihood equations have a solution 0„(a) within Q„ and the proof 
can be completed as in the one-dimensional case. 

To obtain the needed facts concerning the behavior of the likelihood on Q„ for 
small a, we expand the log likelihood about the true point 0° and divide by n to 
find 

-HO) - - 7 ( 0 °) 

n n 

= -ZAj(x)(&j - 0?) 
n J 
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To prove that the maximum of this difference for 0 on Q a is negative with proba¬ 
bility tending to 1 if a is sufficiently small, we will show that with high probability 
the maximum of S 2 is negative while Si and S 3 are small compared to S 2 . The 
basic tools for showing this are the facts that by (5.8), (5.9), and the law of large 
numbers. 


1 1 9 

(5.11) -A .-(X) =- 1(0) -> 0 in probability 

n n ddj e=&0 

and 

(5.12) -B jk (X)= - 1(0) -! lk (0°) in probability. 

n n d6jdO k e=e0 

Let us begin with Si. On Q a , we have 

I Si | < -aE|A y -(X)|. 

n 

For any given a, it follows from (5.11) that \Aj(X)\/n < a 2 and hence that 
| Si | < sa } with probability tending to 1. Next, consider 

(5.13) 2S 2 = EE[-/ ;t (0 o )(0 7 - - ejm - 0 t 0 )] 

+ee l -B Jk (X) - [-/^(0 0 )] [ (Oj - d°)(e k - e° k ). 

For the second term, it follows from an argument analogous to that for Si that its 
absolute value is less than s 2 a 3 with probability tending to 1. The first term is a 
negative (nonrandom) quadratic form in the variables (Oj — 6 ®). By an orthogonal 
transformation, this can be reduced to diagonal form Ea,-<c 2 with Q a becoming 
E£ ( 2 = a 1 . Suppose that the a’s that are negative are numbered so that a v < a v _. i < 
• • • < 7-i <0. Then, E a,C, 2 < a, ET, 2 = 7.|« 2 . Combining the first and second 
terms, we see that there exist c > 0 and ciq > 0 such that for a < a 0 

S 2 < — ca 1 
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with probability tending to 1 . 

Finally, with probability tending to 1, 

1 

— YjM jki(Xi) < 2 Hijki 

and hence .S' 3 1 < ba 3 on Q a where 

s 3 

b = —YY.Ym.jki. 

Combining the three inequalities, we see that 

(5.14) max(Si + S 2 + S 3 ) < —ca 2 + (b + s)a 3 , 

which is less than zero if a < c/(b + s), and this completes the proof of (i). 

(b) and (c) Asymptotic Normality and Efficiency. This part of the proof is ba¬ 
sically the same as that of Theorem 3.10. However, the single equation derived 
there from the expansion of 6 n —6 0 is now replaced by a system of .v equations 
which must be solved for the differences ( 9j„ — 0 (3 ). This makes the details of the 
argument somewhat more cumbersome. In preparation, it will be convenient to 
consider quite generally a set of random linear equations in s unknowns, 

S 

(5.15) Y2 A J k » Yk ’> = T J" O' = 1 > •••’*)• 

k=l 

□ 

Lemma 5.2 Let(T ln , ..., T sn ) be a sequence of random vectors tending weakly to 
(T\,... ,T S ) and suppose that for each fixed j and k, Ajis a sequence of random 
variables tending in probability to constants ajkfor which the matrix A = \\ajk\ 
is nonsingular. Let B = = A -1 . Then, if the distribution of{T\, ..., T s ) has 

a density with respect to Lebesgue measure over E s , the solutions ( Y\ n , ..., Y sn ) 
of (5.15) tend in probability to the solutions (Y\, ..., Y s ) of 

S 

(5.16) Y] a j k Yk = Tj (j =),..., s) 

k=l 

given by 

s 

(5.17) Yj = Y b J* T k- 

k= 1 

Proof. With probability tending to 1, the matrices ||A^nl| are nonsingular, and 
by Theorem 1.8.19 (Problem 5.1), the elements of the inverse of ||A^„|| tend 
in probability to the elements of B. Therefore, by a slight extension of Theorem 
1.8.10, the solutions of (5.15) have the same limit distribution as those of 

S 

(5.18) Yj„ = Y b jkT kn ■ 

k= 1 

By applying Theorem 1.8.19 to the set S , 

(5.19) Y.b\ k T k < yi, ..., Yb sk T k < y s . 



466 


ASYMPTOTIC OPTIMALITY 


[6.5 


it is only necessary to show that the distribution of (T \,..., T s ) assigns probability 
zero to the boundary of (5.19). Since this boundary is contained in the union of 
the hyperplanes Y,bj k Tk = yj. the result follows. □ 

Proof of Parts (b) and (c) of Theorem 5.1. In the generalization of the proof of 
Theorem 3.10, expand dl(0)/ddj = lj(6) about 0 0 to obtain 

(5.20) Tj(0) = + E(0* - 9°)l'; k (0°) 

+ l -w{e k -e o k m-eW j ' kl (0*) 

where l" k and/- denote the indicated second and third derivatives of / and where 
0* is a point on the line segment connecting 0 and 0 °. In this expansion, replace 
0 by a solution 0„ of the likelihood equations, which by part (a) of the theorem 
can be assumed to exist with probability tending to 1 and to be consistent. The left 
side of (5.20) is then zero and the resulting equations can be written as 

-rj k (0°)+ - 0,°)/"h(O 

n J In J 

These have the form (5.15) with 


(5.21) VnE(&-0*°) 


(5.22) 


Y k „=Vn(9 k -e° k ), 

Ajkn = l -l" jk (0°) + - 9?)iy kl (0*), 


Tjn = -J= l 'j(0 Q ) = -sfn 




n ‘r-f d 9. 

1 = 1 J 


Since Ego[(d/d9j) log f(Xi\0)] = 0, the multivariate central limit theorem (Theo¬ 
rem 1.8.21) shows that ( T\„ ,..., T sn ) has a multivariate normal limit distribution 
with mean zero and covariance matrix 1(0 °). 

On the other hand, it is easy to see—again in parallel to the proof of Theorem 
3.10—that 

(5.23) A jkn 4 a jk = E[f; k (0 0 )] = -I jk (0°). 

The limit distribution of the T’s is therefore that of the solution (Y\..... Y s ) of the 
equations 

S 

(5.24) Y, = Tj 

k= 1 

where T = (7),... , T s ) is multivariate normal with mean zero and covariance 
matrix 1(0°). It follows that the distribution of Y is that of 

[I(0°)r l T, 

which is a multivariate distribution with zero mean and covariance matrix [1(0 °)]~ 1 . 
This completes the proof of asymptotic normality and efficiency. □ 
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If the likelihood equations have a unique solution 0 then 0„ is consistent, 
asymptotically normal, and efficient. It is, however, interesting to note that even if 
the parameter space is an open interval, it does not follow as in Corollary 3.8 that 
the MLE exists and hence is consistent (Problem 5.6). Sufficient conditions for 
existence and uniqueness are given in Makelainen, Schmidt, and Styan (1981). 

As in the one-parameter case, if the solution of the likelihood equations is not 
unique. Theorem 5.1 does not establish the existence of an efficient estimator of 0. 
However, the methods mentioned in Section 2.5 also work in the present case. In 
particular, if 0„ is a consistent sequence of estimators of 0 , then the solutions 0,, of 
the likelihood equations closest to 0 n , for example, in the sense that T(0j n — 9j n ) 2 
is smallest, is asymptotically efficient. 

More convenient, typically, is the approach of Theorem 4.3, which we now 
generalize to the multiparameter case. 

Theorem 5.3 Suppose that the assumptions of Theorem 5.1 hold and that 6j n is 
a n-consistent estimator of Oj for j = 1,..., s. Let {<5^,,, k = 1......y} he the 

solution of the equations 

S 

(5.25) - ~e k „)l"j k (0n ) = -I'jCOn). 

k= 1 

Then, 8 n = (8\„, ..., 8 sn ) satisfies (5.10) with 8j„ in place of 6j n and, thus, is 
asymptotically efficient. 

Proof. The proof is a simple combination of the proofs of Theorem 4.3 and 5.1 
and we shall only sketch it. Expanding the right side about 0° allows us to rewrite 

(5.25) as 

- e kn )i" jk (0 n ) = -i'j(0°) - n k (0h, - e»)i" jk (0°) + r„ 

where 

r„ = - l -^ l (e kn - e° k m n - ey; kl (0* n ) 

and hence as 

y/fiV k (8 kn -O o k )-l"(0 n ) 
n J 

(5.26) = —L'(0°) + £,(4 - 0*°) -l" k (0 n ) - + 4= R„. 

L n J n 1 J 

This has the form (5.15), and it is easy to check (Problem 5.2) that the limits (in 
probability) of the A - jkn are the same aj k as in (5.23) and that the second and 
third terms on the right side of (5.26) tend toward zero in probability. Thus, the 
joint distribution of the right side is the same as that of the 7)„ given by (5.22). If 
follows that the joint limit distribution of the ffin (S kn — 0 k ) is the same as that of 
the ffin(6 kn — 0 k ) in Theorem 3.2, and this completes the proof. □ 

The following result generalizes Corollary 4.4 to the multiparameter case. 

Corollary 5.4 Suppose that the assumptions of Theorem 5.3 hold and that the 
elements lj k (0 ) of the information matrix of the XT are continuous. Then, the 
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solutions S' kn of the equations 

(5.27) n'£(8' kn -d kn )I jk (0 n ) = l' j (d n ) 

are asymptotically efficient. 

The proof is left to Problem 5.5. 


6 Applications 


Maximum likelihood (together with some of its variants) is the most widely used 
method of estimation, and a list of its applications would cover practically the 
whole field of statistics. [For a survey with a comprehensive set of references, 
see Norden 1972-1973 or Scholz 1985.] In this section, we will discuss a few 
applications to illustrate some of the issues arising. The discussion, however, is 
not carried to the practical level, and in particular, the problem of choosing among 
alternative asymptotically efficient methods is not addressed. Such a choice must 
be based not only on theoretical considerations but requires empirical evidence 
on the performance of the estimators at various sample sizes. For any specific 
example, the relevant literature should be consulted. 


Example 6.1 Weibull distribution. Let X\. ..., X n be iid according to a two- 
parameter Weibull distribution, whose density it is convenient to write in a param¬ 
eterization suggested by Cohen (1965b) as 

(6.1) — x v ~ l e~ xY ^, x > 0, f> > 0, y > 0, 

P 

where y is a shape parameter and f> i/y a scale parameter. The likelihood equations, 
after some simplification, reduce to (Problem 6.1) 


( 6 . 2 ) 


h(y) = 


T,xj' log Xi 


— = - E log X[ 

y n 


and 

(6.3) P = Y,xf/n. 

To show that (6.2) has at most one solution, note that h'iy) exceeds the derivative 
of the first term, which equals (Problem 6.2) ’Eafp, — (Ea,- p,) 2 with 

(6.4) a, = log Xi , pi = e ya ‘ / ^ e Yaj . 


It follows that h'iy) > 0 for all y > 0. That (6.2) always has a solution follows 
from (Problem 6.2): 

1 

(6.5) — oo = lim h(y) < -Elogx,- < logX(„) = lim h(y). 

y-> 0 n y—>oo 

This example, therefore, illustrates the simple situation in which the likelihood 
equations always have a unique solution. j 


Example 6.2 Location-scale families. Let X\, ..., X n be iid, each with density 
(1 / a) f[{x —%) / a). The calculation of an ELE is easy when the likelihood equations 
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have a unique root (f, a). It was shown by Barndorff-Nielsen and Blaesild (1980) 
that sufficient conditions for this to be the case is that f(x) is positive, twice 
differentiable for all x, and strongly unimodal. Surprisingly, Copas (1975) showed 
that it is unique also when / is Cauchy, despite the fact that the Cauchy density 
is not strongly unimodal and that in this case the likelihood equation can have 
multiple roots when a is known. Ferguson (1978) gave explicit formulas for the 
Cauchy MLEs for;; = 3 or 4. See also Haas et al. 1970 and McCullagh 1992. 

In the presence of multiple roots, the simplest approach typically is that of 
Theorem 5.3. The -consistent estimators of £ and a required by this theorem 
are easily obtained in the present case. As was pointed out in Example 4.5, the 
mean or median of the XX will usually have the desired property for f. (When 
/ is asymmetric, this requires that § be specified to be some particular location 
measure such as the mean or median of the distribution of the X ,-.) If E(Xf) < 
oo,a„ = ^T,(Xi — X) 2 /n will be ^/n-consistent for a if the latter is taken to be 
the population standard deviation. If E(Xf) = oo, one can instead, for example, 
take a suitable multiple of the interquartile range X (J t, — X^, where k = [in /A] 
and j = [n/4] (see, for example, Mosteller 1946). 

If / satisfies the assumptions of Theorem 5.1, then [->/”(!« — £), n(a„ — a)] 
have a joint bivariate normal distribution with zero means and covariance matrix 
7 _1 (a) = 11 Iij(a) 11 -1 , which is independent of £ and where 1(a) is given by (2.6.20) 
and (2.6.21). || 


If the distribution of the X, depends on 0 = (0\, , 0 S ), it is interesting to 

compare the estimation of 9j when the other parameters are unknown with the 
situation in which they are known. The mathematical meaning of this distinction 
is that an estimator is permitted to depend on known parameters but not on un¬ 
known ones. Since the class of possible estimators is thus more restricted when the 
nuisance parameters are unknown, it follows from Theorems 3.10 and 5.1 that the 
asymptotic variance of an efficient estimator when some of the 9's are unknown 
can never fall below its value when they are known, so that 

(6.6) jXj < umji, 

as was already shown in Section 2.6 as (2.6.25). There, it was also proved that 
equality holds in (6.6) whenever 


(6.7) 


g^log/(X|0),4-log/(X|#) 


J dd k 


= 0 


and that this condition, which states that 


for all j i-k. 


(6.8) 1(0) is diagonal, 

is also necessary for equality. For the location-scale families of Example 6.2, it 
follows from (2.6.21) that I \2 = 0 whenever / is symmetric about zero but not 
necessarily otherwise. For symmetric /, there is therefore no loss of asymptotic 
efficiency in estimating £ or a when the other parameter is unknown. 

Quite generally, if the off-diagonal elements of the information matrix are zero, 
the parameters are said to be orthogonal. Although it is not always possible to find 



470 


ASYMPTOTIC OPTIMALITY 


[6.6 


an entire set of orthogonal parameters, it is always possible to obtain orthogonality 
between a scalar parameter of interest and the remaining (nuisance) parameters. 
See Cox and Reid 1987 and Problem 6.5. 

As another illustration of efficient likelihood estimation, consider a multiparam¬ 
eter exponential family. Here, UMVU estimators often are satisfactory solutions 
of the estimation problem. However, the estimand may not be U -estimable and 
then another approach is needed. In some cases, even when a UMVU estimator 
exists, the MLE has the advantage of not taking on values outside the range of the 
estimand. 

Example 6.3 Multiparameter exponential families. Let X = (X\,.X„) be 
distributed according to an -parameter exponential family with density (1.5.2) 
with respect to a <r-finite measure //, where x takes the place of .r and where it is 
assumed that T\(X ),..., TJX) are affinely independent with probability 1. Using 
the fact that 

9 9 

(6.9) — [/(*)] = ~ — [A( V )] + Tj(x) 

dijj orjj 

and other properties of the densities (1.5.2), one sees that the conditions of Theorem 
5.1 are satisfied when the X’s are iid. By (1.5.14), the likelihood equations for the 
if s reduce to 

(6.10) Tj(x)= EJTJX)]. 

If these equations have a solution, it is unique (and is the MLE) since I(i )) is a 
strictly concave function of rj. This follows from Theorem 1.7.13 and the fact that, 
by (1.5.15), 

3 2 3 2 

(6.11) - ——[l(ri)] = ——[Mr,)] = cov[ TJX), TJX)] 

or]jO i] k or)jdr] k 

and that, by assumption, the matrix with entries (6.11) is positive definite. 

Sufficient conditions for the existence of a solution of the likelihood equations 
are given by Crain (1976) and Barndorff-Nielsen (1978, Section 9.3, 9.4), where 
they are shown to be satisfied for the two-parameter gamma family of Table 1.5.1. 

An alternative method for obtaining asymptotically efficient estimators for the 
parameters of an exponential family is based on the mean-value parameteriza¬ 
tion (2.6.17). Slightly changing the formulation of the model, consider a sample 
(X\, ..., X n ) of size n from the family (1.5.2), and let 7) = [Tj{X{) + ••• + 
Tj(.X n )]f n and 6j = E(Tj). By the CLT, the joint distribution of the ~Jn{f) — 9j ) is 
multivariate normal with zero means and covariance matrix cr,j = cov[7)(X), 7’, (X)] 
This proves the 7) to be asymptotically efficient estimators by (2.6.18). ;| 

Lor further discussion of maximum likelihood estimation in exponential fam¬ 
ilies, see Berk 1972b, Sundberg 1974, 1976, Barndorff-Nielsen 1978, Johansen 
1979, Brown 1986a, and Note 10.4. 

In the next two examples, we shall consider in somewhat more detail the most 
important case of Example 6.3, the multivariate and, in particular, the bivariate 
normal distribution. 
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Example 6.4 Multivariate normal distribution. Suppose we let (X \ v ,..., X pv ), 

v = 1,..., n, be a sample from a nonsingular normal distribution with means 
E(Xj V ) = £,■ and covariances cov(X, v , Xj v ) = o,j. By (1.4.15), the density of the 
X’s is given by 

(6.12) | S |" /2 (27t )~ pn/2 exp 
where 

(6.13) S ]k ='E v (X jv -Sj){X kv -$ k ), j,k = l,...,p, 

and where S = ||?;^|| is the inverse of the covariance matrix ||ct^||. 

Consider, first, the case in which the £ ’s are known. Then, (6.12) is an exponential 
family with Tj k = — (1 /2) Sj k . If the matrix | \Oj k \\ is nonsingular, the 7) k are affinely 
independent with probability 1, so that the result of the preceding example applies. 
Since E(Sj k ) = noj k , the likelihood equations (6.10) reduce to noj k = Sjk and 
thus have the solutions 

(6.14) ajk=-S jk . 

n 

The sample moments and correlations are, therefore, ELEs of the population 
variances, covariances, and correlation coefficients. Also, the (jk )th element of 
11 ojk \ I is an asymptotically efficient estimator of t]jk- In addition to being the 
MLE, <7 jk is the UMVU estimator of rt jk (Example 2.2.4). 

If the f’s are unknown, 

lj = —H v Xj V = Xj, 
n 

and <7jk, given by (6.14) but with Sjk now defined as 

(6.15) S jk = 'E v (X jv - X r )(X kv - X k .), 

continue to be ELEs for ^ and n jk (Problem 6.6). 

If § is known, the asymptotic distribution of S jk given by (6.13) is immedi¬ 
ate from the central limit theorem since Sjk is the sum of n iid variables with 
expectation 

E(X jv - t,)(X kv - f,) = a jk 

and variance 

E[(X jv - $j)\X kv - I,) 2 ] - aj k . 

If j k, it follows from Problem 1.5.26 that 

E[(X jv - $j)\X kv - ^) 2 ] = OjjO k k + 2a) k 

so that 

var [(X jv - %j){X kv - fi)] = ajja k k + oj k 

and 

(6.16) ~ Ojk'j -a N (0, OjjO k k + oj k )- 

If £ is unknown, the Sj k given by (6.15) are independent of the X h and the 
asymptotic distribution of (6.15) is the same as that of (6.13) (Problem 6.7). || 
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Example 6.5 Bivariate normal distribution. In the preceding example, it was 
seen that knowing the means does not affect the efficiency with which the covari¬ 
ances can be estimated. Let us now restrict attention to the covariances and, for 
the sake of simplicity, suppose that p = 2. With an obvious change of notation, 
let (X h Yj), i = 1, ..., n be iid, each with density (1.4.16). Since the asymptotic 
distribution of <r, f, and p are not affected by whether or not f and q are known, 
let us assume £ = q = 0. For the information matrix 1(0) [where 0 = (cr 2 , r 2 , p)], 
we find [Problem 6.8(a)] 


(6.17) 


(1 - p 2 )I(0) 


2 — p 2 

-P 2 

~P 

4(7 4 

4(7 2 T 2 

2(7 2 

-P- 

2 — p 2 

~P 

4cr 2 r 2 

4r 4 

2t 2 

~P 

~P 

1 + p : 

2(7 2 

2r 2 

1 - p 


Inversion of this matrix gives the covariance matrix of the *Jn(6j — 6j) as [Problem 


6.8(b)] 

(6.18) 


2er 4 2p 2 CT 2 r 2 p(l - p 2 )cr 2 

2p 2 a 2 x 2 2r 4 p(l - p 2 )r 2 

p(l - p 2 )a 2 p(l - p 2 )r 2 (1 - p 2 ) 2 


Thus, we find that 


Vn(« a 2 - cr 2 ) 4 A(0, 2ct 4 ), 

(6.19) Vn(? 2 - r 2 ) 4 N( 0, 2r 4 ), 

Mp - P) 4 N[ 0, (1 - p 2 ) 2 ]. 

On the other hand, if cr and r are known to be equal to 1, the MLE p of p satisfies 
(Problem 6.9) 


( 6 . 20 ) 


sft(p-p)^ N 0 


(1 - P 2 ) 2 


1 + p 2 

whereas if p and r are known, the MLE a of a satisfies (Problem 6.10) 


( 6 . 21 ) 


Vn(<7 — a 2 ) -a N ( 0 


4<t 4 (1 - p 2 ) 

2 — p 2 


A criterion for comparing p to p is provided by the asymptotic relative efficiency. 

d 

Definition 6.6 If the sequence of estimators S„ of g(0) satisfies ffin['6 n — g(0)\ -> 
N(Q. r 2 ), and the sequence of estimators S' n ,. where S' n , is based on n' = n'(n) ob- 

d 

servations, also satisfies *J~n [S' n , — g(0)] —>■ N(0 , r 2 ), then the asymptotic relative 
efficiency (ARE) of [5„} with respect to {<5'} is 


e&,S' 


lim 


n'(n) 


n—>o o fj 
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provided the limit exists and is independent of the subsequences n'. 

The interpretation is clear. Suppose, for example, the e = 1/2. Then, for large 
values of n, n' is approximately equal to (l/2)n. To obtain the same limit distri¬ 
bution (and limit variance), half as many observations are therefore required with 
S' as with S. It is then reasonable to say that S' is twice as efficient as S or that S is 
half as efficient as S'. 

The following result shows that in order to obtain the ARE, it is not necessary 
to evaluate the limit n'{n)/n. 

Theorem 6.7 If *Jn [<5,„ — g(0)] N( 0, r 2 ), i = 1,2, then the ARE of { 82 ,,} with 

respect to {<5i„} exists and is e 2 j = r 2 /r|. 

Proof Since 

VTl - g(9)] = M>ft[ 82 n' - 8(0)], 

V n 

it follows from Theorem 1.8.10 that the left side has the same limit distribution 
N( 0, Tf) as ~Jn [<5i„ — g(0)] if and only if lim [«/«'(«)] exists and 

n 

t 2 


r-? lim ■ 


n'(n) 


as was to be proved. 


□ 


Example 6.8 Continuation of Example 6.5. It follows from (6.19) and 6.20) that 
the efficiency of p to p is 

1 


(6.22) e, =--. 

P’P l+p 2 

This is 1 when p = 0 but can be close to 1/2 when |p| is close to 1. Similarly, 

2(1 - p) 2 


(6.23) 


2 -p 2 


This efficiency is again 1 when p = 0 but tends to zero as | p | —> 1. This last result, 
which at first may seem surprising, actually is easy to explain. If p were equal to 
1, and r = 1 say, we would have X, = a Y t . Since both X, and Y, are observed, we 
could then determine a without error from a single observation. j 


Example 6.9 Efficiency of nonparametric UMVU estimator. As another ex¬ 
ample of an efficiency calculation, recall Example 2.2.2. If Xi ,..., X n are iid 
according to N( 6 , 1), it was found that the UMVU estimator of 

P = P(X 1 < a) 


is 

(6.24) 





(a - X) 


Suppose now that we do not trust the assumption of normality; then, we might, 
instead of (6.24), prefer to use the nonparametric UMVU estimator derived in 
Section 2.4, namely 

1 


(6.25) 


&2n = -(No. of Xi < a). 


n 
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What do we lose by using (6.25) instead of (6.24) if the Z’s are N(0, 1) after 
all? Note that p is then given by 

(6.26) p = 0(a - 9) 


and that 

Vn(Si„ - p) (V[0, cp 2 (a - 0)]. 

On the other hand, n<$ 2 „ is the number of successes in n binomial trials with success 
probability p, so that 

Vn{&in ~ P) —r N( 0, p( 1 - p)). 


It thus follows from Theorem 6.7 that 


(6.27) 


<j> 2 (a — 9) 

0(o — 6>)[1 - $(fl -0)]' 


At a = 9 (when p = 1 /2), e 2 ,i = (1 /2 jt)/( 1 /4) = 2/7T 0.637. As a — 9 —> oo, 

the efficiency tends to zero (Problem 6.12). It can be shown, in fact, that (6.27) is 
a decreasing function of \a — 9 | (for a proof, see Sampford, 1953). The efficiency 
loss resulting from the use of & 2 n instead of 8\„ is therefore quite severe. If the 
underlying distribution is not normal, however, this conclusion could change (see 
Problem 6.13). || 


Example 6.10 Normal mixtures. Let X\,... ,X n be iid, each with probability 
p as N {%, a 2 ) and probability cj = 1 — p as N(rj. r 2 ). (The Tukey models are 
examples of such distributions with i] = if.) The joint density of the Z’s is then 
given by 


(6.2S) ]“]' 

1 = 1 


P 

sfljta 


exp 


1 

2er 2 


(x, - 


+ 


q 

y/2nx 


exp 


1 

2r 2 


(x, - q) 



This is a sum of non-negative terms of which one, for example, is proportional to 


n —1 


exp 




f) 2 


1 ” 

1=2 


rjY 


When § = a'| and a -> 0, this term tends to infinity for any fixed values of r, and 
X 2 ,... ,x n . The likelihood is therefore unbounded and the MLE does not exist. (The 
corresponding result holds for any other mixture with density V\{(p/o)f[(x l — 
H)/o] + (q/x)f[{x, - q)/r]} when /(0) ^0.) 

On the other hand, the conditions of Theorem 5.1 are satisfied (Problem 6.10) 
so that efficient solutions of the likelihood equations exist and asymptotically 
efficient estimators can be obtained through Theorem 5.3. One approach to the 
determination of the required v / 77-consistent estimators is the method of moments. 
In the present case, this means equating the first five moments of the Z’s with the 
corresponding sample moments and then solving for the five parameters. For the 
normal mixture problem, these estimators were proposed in their own right by K. 
Pearson (1894). For a discussion and possible simplifications, see Cohen 1967, 
and for more details on mixture problems, see Everitt and Hand 1981, Titterington 
et al. 1985, McLachlan and Basford 1988, and McLachlan 1997. 



6.7] 


EXTENSIONS 


475 


A study of the improvement of an asymptotically efficient estimator over that 
obtained by the method of moments has been carried out for the case for which it 
is known that r = cr (Tan and Chang 1972). If A = (rj — f)/er, the AREs for the 
estimation of all four parameters depend only on A and p. As an example, consider 
the estimation of p. Here, the ARE is < 0.01 if A < 1/2 and p < 0.2; it is < 0.1 
if A < 1/2 and 0.2 < p < 0.4, and is > 0.9 if A > 0.5. (For an alternative 
starting point for the application of Theorem 5.3, see Quandt and Ramsey 1978, 
particularly the discussion by N. Kiefer.) ) 

Example 6.11 Multinomial experiments. Let (Xq, X ]...., X s ) have the multi¬ 
nomial distribution (1.5.4). In the full-rank exponential representation, 

exp[/i log p 0 + x, \og(pi/p 0 ) + ■ ■ ■ + x s \og(p s /p Q )]h(x), 

the statistics 7) can be taken to be the X ; . Using the mean-value parameterization, 
the likelihood equations (6.10) reduce to npj = Xj so that the MLE of pj is 
Pi = Xj/n (j = 1,.... s). If Xj is 0 or n, the likelihood equations have no 
solution in the parameter space 0 < pj < 1, pj < 1. However, for any 
fixed vector p, the probability of any Xj taking on either of these values tends 
to zero as n -> oo. (But the convergence is not uniform, which causes trouble 
for asymptotic confidence intervals; see Lehmann and Loh 1990.) That the MLEs 
pj are asymptotically efficient is seen by introducing the indicator variables Xj V , 
v = 1which are 1 when the nth trial results in outcome j and are 0 
otherwise. Then, the vectors ( X ov ,..., X sv ) are iid and Tj Xj \ + • • • + Xj n , so that 
asymptotic efficiency follows from Example 6.3. j 

In applications of the multinomial distribution to contingency tables, the p 's are 
usually subject to additional restrictions. Theorem 5.1 typically continues to apply, 
although the computation of the estimators tend to be less obvious. This class of 
problems is treated comprehensively in Haberman (1973, 1974), Bishop, Fien- 
berg, and Holland (1975), and Agresti (1990). Empty cells often present special 
problems. 


7 Extensions 

The discussion of efficient likelihood estimation so far has been restricted to the iid 
case. In the present section, we briefly mention extensions to some more general 
situations, which permit results analogous to those of Sections 6.3-6.5. Treatments 
notrequiring the stringent (but frequently applicable) assumptions of Theorem 3.10 
and 5.1 have been developed by Le Cam 1953, 1969, 1970, 1986, and others. For 
further work in this direction, seePfanzagl 1970,1994, Weiss and Wolfowitz 1974, 
Ibragimov and Has’minskii 1981, Blyth 1982, Strasser 1985, and Wong(1992). 

The theory easily extends to the case of two or more samples. Suppose that the 
variables X a i, ..., X a „ a in the ath sample are iid according to the distribution with 

density f a j) {a = 1.r) and that the r samples are independent. In applications, 

it will typically turn out that the vector parameter 6 = (fii,... ,G S ) has some 
components occurring in more than one of the r distributions, whereas others may 
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be specific to just one distribution. However, for the present discussion, we shall 
permit each of the distributions to depend on all the parameters. 

The limit situation we shall consider supposes that each of the sample sizes n„ 
tends to infinity, all at the same rate, but that r remains fixed. Consider, therefore, 
sequences of sample sizes n a ,k (k = 1,..., oo) with total sample size Nk = 'E' a=i n ak 
such that 

(7.1) n a ,k/Nk -* X a as k -> oo 
where = 1 and the X a are > 0. 

Theorem 7.1 Suppose the assumptions of Theorem 5.1 hold for each of the den¬ 
sities f a fi. Let I la> (0) denote the information matrix corresponding to f a y i and 
let 

(7.2) 1(0) = T,k a &\0). 

The log likelihood 1(0) is given by 

r n a 

1(0) = log f<*,e( x <*j) 

a=l 7=1 

and the likelihood equations by 

(7.3) ^/(0) = O (j = U...,s). 
ddj 

With these identifications, the conclusions of Theorem 5.1 remain valid. 

The proof is an easy extension of that of Theorem 5.1 since 1(0), and therefore 
each term of its Taylor expansion, is a sum of r independent terms of the kind 
considered in the proof of Theorem 5.1 (Problem 7.1). (For further discussion of 
this situation, see Bradley and Gart 1962.) 

That asymptotic efficiency continues to have the meaning it had in Theorems 
3.10 and 5. land follows from the fact that Theorem 2.6 and its extension to the mul¬ 
tiparameter case also extends to the present situation (see Bahadur 1964, Section 

4). 

Corollary 7.2 Under the assumptions of Theorem 7.1, suppose that for each a, 
all off-diagonal elements in the jth row and jth column of f a) (0) are zero. Then, 
the asymptotic variance of 0j is the same when the remaining 0’s are unknown as 
when they are known. 

Proof. If the property in question holds for each l <a> (0), it also holds for 1(0) and 
the result thus follows from Problem 6.3. □ 

The following four examples illustrate some applications of Theorem 7.1. 

Example 7.3 Estimation of a common mean. Let X \,..., X m and Y\, ...,Y n 
be independently distributed according to N(f. o 2 ) and !\'(fi, r 2 ), respectively, 
with f, a, and r unknown. The problem of estimating § was considered briefly in 
Example 2.2.3 where it was found that a UMVU estimator for f does not exist. 
Complications also arise in the problem of asymptotically efficient estimation of 
C 
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Since the MLEs of the mean and variance of a single normal distribution are 
asymptotically independent. Corollary 7.2 applies and shows that f can be esti¬ 
mated with the efficiency that is attainable when a and r are known. Now, in that 
case, the MLE—which is also UMVU—is 

- ( m/a 2 )X + ( n/r 2 )Y 

^ m/a 2 + n/x 2 

It is now tempting to claim that Theorem 1.8.10 implies that the asymptotic dis¬ 
tribution of / is not changed when a 1 and r 2 are replaced by 

(7.4) a 2 = —E)*,- - X ) 2 and f 2 = —-—E(T, - Y) 2 

m — 1 n — 1 

and the resulting estimator, say f, is asymptotically normal and efficient. However, 
this does not immediately follow. To see why, let us look at the simple case where 
m = n and, hence, var(f) = (a 2 + r 2 )/n. Consider the asymptotic distribution of 

r i-S r t-i , r t~S 

(7.5) 'Jn = = V" —F=== + >/» ,„ 

a 2 + r 2 a 2 + x 2 Ver 2 + r 2 

Since £ is efficient, efficiency of § will follow if — k) —>■ 0, which is not 
the case. But Theorem 7.1 does apply, and an asymptotically efficient estimator is 
given by the full MLE (see Problem 7.2). j 


Example 7.4 Balanced one-way random effects model. Consider the estimation 
of variance components a\ and a 2 in model (3.5.1). In the canonical form (3.5.2), 
we are dealing with independent normal variables Z\\ and Z,i, (i =2, ..., ,v), and 
Zij, (i = ],.... .v, j = 2,..., n). We shall restrict attention to the second and third 
group, as suggested by Thompson (1962), and we are then dealing with samples 
of sizes 5 — 1 and (n — l)j from (V(0, r 2 ) and N( 0, cr 2 ), where r 2 = a 2 + na\. 
The assumptions of Theorem 7.1 are satisfied with r = 2, 6 = (a 2 , r 2 ), and the 
parameter space £2 = {(er, r) : 0 < a 2 < r 2 }. For fixed n, the sample sizes 
n i =5 — 1 and «2 = s(n — 1) tend to infinity as s —> oo, with Ai = \/n and 
a 2 = (n - 1 )/n. 

The joint density of the second and third group of Z’s constitutes a two-parameter 
exponential family; the log likelihood is given by 


(7.6) 


1(0 ) = n 2 log a + ri \ log r + ^ + ^ + c 


where S 2 = Y^i=i '^"j=i tfj an d S 2 A = Jj. =l Zf v By Example 7.8, the likelihood 
equations have at most one solution. Solving the equations yields 

(7.7) <t 2 = S 2 /n 2 , T 2 = S\/n u 


and these are the desired (unique, ML) solution, provided they are in C, that is, 
they satisfy 

(7.8) a 2 < x 2 . 
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It follows from Theorem 5.1 that the probability of (7.8) tends to 1 as s —> oo for 
any 6 e £2; this can also be seen directly from the fact that a 2 and f 2 tend to ct 2 
and t 2 in probability. 

What can be said when (7.8) is violated? The likelihood equations then have no 
root in £2 and an MLE does not exist (the likelihood attains its maximum at the 
boundary point ct 2 =f 2 =(S A + S 2 )/(n i + n 2 ) which is not in £2). However, none of 
this matters from the present point of view since the asymptotic theory has nothing 
to say about a set of values whose probability tends to zero. (For small-sample 
computations of the mean squared error of a number of estimators of a 2 and <r 2 , 
see Klotz, Milton and Zacks 1969, Portnoy 1971, and Searle et al. 1992.) 

The joint asymptotic distribution of a 2 and ? 2 can be obtained from Theorem 
6.7 or directly from the distribution of S\ and S 2 and the CLT, and a linear trans¬ 
formation of the limit distribution then gives the joint asymptotic distribution of 
ct 2 and ctjj (Problem 7.3). || 

Example 7.5 Balanced two-way random effects model. A new issue arises as we 
go from the one-way to the two-way layout with the model given by (3.5.5). After 
elimination of Zm (in the notation of Example 5.2), the data in canonical form 
consist of four samples Z,n (i = 2,, /), Z\y\ (j = 2,..., 7), Zy \ (i = 2, ..., 7, 
j = 2, .... 7), and Z ljk (i = 1, j = 1 ,..., J, k = 2,..., n ), and the 
parameter is 6 = (er, r a, tg, ifc) where 

(7.9) = cr 2 + ner^, tg = nla\ + no^ + a 2 , x\ = nj o\+no^ + a 2 

so that £2 = {6 : a 2 < x}. < r A , Tg}. The joint density of these variables constitutes 
a four-parameter exponential family. The likelihood equations thus again have at 
most one root, and this is given by 

a 2 = S 2 /(n - 1)77, x 2 c = S 2 C /(I - 1)(7 - 1), 

rl = S 2 b /U - 1), f 2 = S 2 /(I - 1) 

when cr < x£ < Xa , r b ■ No root exists when these inequalities fail. 

In this case, asymptotic theory requires that both I and J tend to infinity, and 
assumption (7.1) of Theorem 7.1 then does not hold. Asymptotic efficiency of the 
MLEs follows, however, from Theorem 5.1 since each of the samples depends 
on only one of the parameters ct 2 , r^, rj, and r f 2 .. The apparent linkage of these 
parameters through the inequalities ct 2 < < r 2 , is immaterial. The true point 

0° = (ct°, x a , Tg, r®) is assumed to satisfy these restrictions, and each parameter 
can then independently vary about the true value, which is all that is needed for 
Theorem 5.1. It, therefore, follows as in the preceding example that the MLEs are 
asymptotically efficient, and that -J(n — 1)77(ct 2 — ct 2 ), and so on. have the limit 
distributions given by Theorem 5.1 or are directly obtainable from the definition 
of these estimators. j 

A general large-sample treatment both of components of variance and the more 
general case of mixed models, without assuming the models to be balanced was 
given by Miller (1977); see also Searle et al. 1992, Cressie and Lahiri 1993, and 
Jiang 1996, 1997. 
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Example 7.6 Independent binomial experiments. As in Section 3.5, let X, 
(i = 1..... ,v) he independently distributed according to the binomial distribu¬ 
tions b(p, , «,). with the p, being functions of a smaller number of parameters. 
If the m tend to infinity at the same rate, the situation is of the type considered 
in Theorem 7.1, which will, in typical cases, ensure the existence of an efficient 
solution of the likelihood equations with probability tending to 1. 

As an illustration, suppose, as in (3.6.12), that the p' s are given in terms of the 
logistic distribution, and more specifically that 


(7.10) 


e 


—(a+fiti) 


Pi = 


l + e -(«+/*<() 

where the f’s are known numbers and a and ft are the parameters to be estimated. 
The likelihood equations 

(7.11) ^2 n/pi = J>, ^2 n i l i Pi = X ! tiXi 


have at most one solution (Problem 7.6) which will exist with probability tending 
to 1 (but may not exist for some particular finite values) and which can be obtained 
by standard iterative methods. 

That the likelihood equations have at most one solution is true not only for the 
model (7.10) but more generally when 


(7.12) 


Pi 


= 1 -FfePjtj) 


where the t s are known, the fts are being estimated, and F is a known distribution 
function with log F(x) and log[l — Fix)] strictly concave. (See Haberman 1974, 
Chapter 8; and Problem 7.7.) For further discussion of this and more general 
logistic regression models, see Pregibon 1981 or Searle et al. 1992, Chapter 10. || 


For the multinomial problem mentioned in the preceding section and those of 
Example 7.6, alternative methods have been developed which are asymptotically 
equivalent to the ELEs, and hence also asymptotically efficient. These methods 
are based on minimizing / 2 or some other functions measuring the distance of the 
vector of probabilities from that of the observed frequencies. (See, for example, 
Neyman 1949, Taylor 1953, Le Cam 1956, 1990, Wijsman 1959, Berkson 1980, 
Amemiya 1980, and Ghosh and Sinha 1981 or Agresti 1990 for entries to the 
literature on choosing between these different estimators.) 

The situation of Theorem 7.1 shares with that of Theorem 3.10 the crucial prop¬ 
erty that the total amount of information T(0) asymptotically becomes arbitrarily 
large. In the general case of independent but not identically distributed variables, 
this need no longer be the case. 

Example 7.7 Total information. Let X, (i = I...., n) be independent Poisson 
variables with E(X ,•) = yftk where the y's are known numbers. Consider two cases, 
(a) Y17-] Vi < 00 ■ The amount of information X, contains about 7. is y, / a 
by (2.5.11) and Table 2.5.1 and the total amount of information T„(X) that 
(Xi, ..., X„) contains about 7. is therefore 
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It is intuitively plausible that in these circumstances X cannot be estimated con¬ 
sistently because only the early observations provide an appreciable amount 
of information. To prove this formally, note that Y n = £" =1 X, is a sufficient 
statistic for X on the basis of (X ..., X„) and that Y„ has Poisson distribu¬ 
tion with mean aI” =| y,. Thus, all the T’s are less informative than a random 
variable Y with distribution P(kE“ 1 y,) in the sense that the distribution of 
any estimator based on Y„ can be duplicated by one based on Y (Problem 
7.9). Since X cannot be estimated exactly on the basis of Y, the result follows. 


(b) = oo. Here, the MLE S„ = L” =| X,/Y" =t y t is consistent and asymptot¬ 

ically normal (Problem 7.10) with 

„ 1 1/2 


(7.14) 




(S„ - X) -> N( 0, X). 


Thus, 8„ is approximately distributed as N[X, 1 /T n (X)] and an extension of The¬ 
orem 2.6 to the present case (see Bahadur 1964) permits the conclusion that S„ is 
asymptotically efficient. 

Note: The norming constant required for asymptotic normality must be propor¬ 
tional to ySJLiKi- Depending on the nature of the y’s, this can be any function 
of n tending to infinity rather than the customary *Jn. In general, it is the total 
amount of information rather than the sample size which governs the asymptotic 
distribution of an asymptotically efficient estimator. In the iid case, T n (0) = nl(0), 
so that y/T n (0 ) is proportional to «Jn. j 

A general treatment of the case of independent random variables with densities 
fj(Xj\0),0 = (0i,, 0 r ), along the lines of Theorems 3.10 and 5.1 has been given 
by Bradley and Gart (1962) and Hoadley (1971) (see also Nordberg 1980). The 
proof (for r = 1) is based on generalizations of (3.18)-(3.20) (see Problem 7.14) 
and hence depends on a suitable law of large numbers and central limit theorem for 
sums of independent nonidentical random variables. In the multiparameter case, 
of course, it may happen that some of the parameters can be consistently estimated 
and others not. 

The theory for iid variables summarized by Theorems 2.6, 3.10, and 5.1 can be 
generalized not only to the case of independent nonidentical variables but also to 
dependent variables whose joint distribution depends on a fixed number of param¬ 
eters 6 = (9 1 ,..., Or) where, for illustration, we take r = 1 . (The generalization 
to r > 1 is straightforward.) The log likelihood 1(9 ) is now the sum of the loga¬ 
rithms of the conditional densities fj(xj\9, at, ..., Xj-\ ) and the total amount of 
information T n (9) is the sum of the expected conditional amounts of information 
Ij(9) in Xj. given X u ..., X,_i: 


f 1 

r 3 i 

2 1 

r 3 i 

tit 

tit 

II 

S' 

-7 

-logfj(X j \9,X u ...,X j . l ) 

r 

-log/;(*#) 


Under regularity conditions on the ff s, analogous to those of Theorems 3.10 
and 5.1 together with additional conditions to ensure that the total amount of 
information tends to infinity as n -» oo and that the appropriate CLT for dependent 
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variables is applicable, it can be shown that with probability tending to 1, there 

^ /_ /V £ 

exists a root 6„ of the likelihood equations such that \jT n (Q)(Q n — 0) —> N(0, 1). 
This program has been carried out in a series of papers by Bar-Shalom (1971), 
Bhat (1974), and Crowder (1976). 5 [The required extension of Theorem 2.6 can be 
obtained from Bahadur (1964); see also Kabaila 1983.] The following illustrates 
the theory with a simple classic example. 

Example 7.8 Normal autoregressive Markov series. Let 

(7.15) Xj=pX j . 1 + Uj, 7=2, 

where the Uj are iid as N( 0, 1), where ft is an unknown parameter satisfying 
\P\ < l, 6 and where X\ is /V(0. a 2 ). The X’s all have marginal normal distributions 
with mean zero. The variance of Xj satisfies 

(7.16) var (Xj) = p 2 var(Xj_ l ) + 1 
and hence var (Xj) = a 2 for all j provided 

(7.17) a 2 = 1/(1 — fi 2 ). 

This is the stationary case in which (Xj,, ..., Xjk) has the same distribution as 
(Xj l+r , ..., Xj k+r ) for all r = 1,2,... (Problem 7.15). 

The amount of information that each Xj(j > 1) contains about /! is (Problem 
= 1/(1 — /l 2 ), so that r„(/l) ~ n/(l —/S 2 ). The general theory therefore 
suggests the existence of a root /)„ of the likelihood equation such that 

(7.18) MX ~P)X N( 0, 1 - X). 

That (7.18) does hold can also be checked directly (see, for example, Brockwell 
and Davis 1987, Section 8.8 ). || 

The conclusions of this section up to this point can be summarized by saying 
that the asymptotic theory developed for the iid case in Sections 6.2-6.6 continues 
to hold—under appropriate safeguards—even if the iid assumption is dropped, 
provided the number of parameters is fixed and the total amount of information 
goes to infinity. 

We shall now briefly consider two generalizations of the earlier situation to 
which this conclusion does not apply. The first concerns the case in which the 
number of parameters tends to infinity with the total sample size. 

In Theorem 7.1, the number r of samples was considered fixed, whereas the 
sample sizes n a were assumed to tend to infinity. Such a model is appropriate when 
one is dealing with a small number of moderately large samples. A quite different 
asymptotic situation arises in the reverse case of a large number (considered as 
tending to infinity) of finite samples. Here, an important distinction arises between 
structural parameters such as / in Example 7.3, which are common to all the 
samples and which are the parameters of interest, and incidental parameters such 

3 A review of the literature of maximum likelihood estimation in both discrete and continuous pa- 
rameter stochastic processes can be found in Basawa and Prakasa Rao(1980). 

6 For a discussion without this restriction, see Anderson (1959) and Heyde and Feigin (1975). 
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as er 2 and r 2 in Example 7.3, which occur in only one of the samples. That Theorem 
5.1 does not extend to this case is illustrated by the following two examples. 


Example 7.9 Estimation of a common variance. Let X aj (j = 1 .....r) he 
independently distributed according to N(6 a , a 2 ), a = 1 ,,n. The MLEs are 

(7.19) e a = X a „ <r 2 = — 

rn 


Furthermore, these are the unique solutions of the likelihood equations. 

However, in the present case, the MLE of a 2 is not even consistent. To see this, 
note that the statistics 

S 2 = E(X aj - X a .) 2 

are identically independently distributed with expectation 

E(S 2 ) = (r - l)cr 2 , 

so that ES 2 /« —*■ (r — l)cr 2 and hence 


(7.20) 



in probability. 


A consistent and efficient estimator sequence of a 1 is available in the present case, 
namely 


j 2 

a 


1 


(r — 1 )n 


-ESz. 


The study of this class of problems (including Example 7.9) was initiated by 
Neyman and Scott (1948), who also considered a number of other examples in¬ 
cluding one in which an MLE is consistent but not efficient. 

A reformulation of the problem of structural parameters was proposed by Kiefer 
and Wolfowitz (1956), who considered the case in which the incidental parameters 
are themselves random variables, identically independently distributed according 
to some distribution, but, of course, unobservable. This will often bring the situation 
into the area of applicability of Theorems 5.1 or 7.1. 

Example 7.10 Regression with both variables subject to error. Let X, and Y t 

(i = 1,..., n) be independent normal with means E(X l ) = ft, and E(V ,) = and 
variances a 2 and r 2 , where /?,- = u+ftft. There is, thus, a linear relationship between 
ft and both of which are observed with independent, normally distributed errors. 
We are interested in estimating ft and, for the sake of simplicity, shall take a as 
known to be zero. Then, 6 = (ft, a 2 , r 2 , ftx ,..., £„), with the first three parameters 
being structural and the |’s incidental. The likelihood is proportional to 

~^Z( Xl - ft ,) 2 - ^Y(y, - ftftft 2 . 

The likelihood equations have two roots, given by (Problem 7.20), 

(7.22) ft = ± /—2 na 2 = Y,x 2 — ^T,Xjyj, 2nx 2 ='Ey 2 — ft'Lxjyj, 

V P 


(7.21) 


exp 
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1 

2 & = xi + -y f , i = 1 ,... ,n 

P 

and the likelihood is larger at the root for which ft Y.x,yj > 0. If Theorem 5.1 
applies, one of these roots must be consistent and, hence, tend to /J in probabil¬ 
ity. Since S\ = E X j and Sy = E Yj are independently distributed according to 
noncentral / 2 -distributions with noncentrality parameters A 2 = E" =1 f 2 and /l 2 /, 2 , 
their limit behavior depends on that of A „. (Note, incidentally, that for /.„ = 0, the 
parameter ft becomes unidentifiable.) Suppose that A 2 /« -> A 2 > 0. The distri¬ 
bution of S\ and Sy is unchanged if we replace each by A 2 //?, and by the law 
of large numbers, YXj/n therefore has the same limit as 

E(X\) = cr 2 + -A 2 —>■ cr 2 + A 2 . 
n 

Similarly, E Yj /n tends in probability to r 2 + /3 2 A 2 and, hence, /§ 2 (r 2 + 

P 2 \ 2 )/(o 2 + A 2 ). Thus, neither of the roots is consistent. [It was pointed out by 
Solari (1969) that the likelihood in this problem is unbounded so that an MLE 
does not exist (Problem 7.21). The solutions (7.22) are, in fact, saddlepoints of the 
likelihood surface.] 

If in (7.21) it is assumed that r = a, it is easily seen that the MLE of is 
consistent (Problem 7.18). For a discussion of this problem and some of its gener¬ 
alizations, see Anderson 1976, Gleser 1981, and Anderson and Sawa 1982. Another 
modification leading to a consistent MLE is suggested by Copas (1972a). 

Instead of (7.21), it is sometimes assumed that the £’s are themselves iid accord¬ 
ing to a normal distribution N(/i, y 2 ). The pairs (X,, Yj) then constitute a sample 
from a bivariate normal distribution, and asymptotically efficient estimators of the 
parameters //, y, (i, a, and r can be obtained from the MLEs of Example 6.4. An 
analogous treatment is possible for Example 7.9. j 

Kiefer and Wolfowitz (1956) have considered not only this problem and that 
of Example 7.9, but a large class of problems of this type by postulating that 
the £’s are iid according to a distribution G, but treating G as unknown, subject 
only to some rather general regularity assumptions. Alternative approaches to the 
estimation of structural parameters in the presence of a large number of incidental 
parameters are discussed by Andersen (1970b) and Kalbfleisch and Sprott (1970). 
A discussion of Example 7.10 and its extension to more general regression models 
can be found in Stuart and Ord (1991, Chapters 26 and 28), and of Example 7.9 
in lewell and Raab (1981). 

A review of these models, also known as measurement error models, is given 
by Gleser (1991) and is the topic of the book by Carroll, Ruppert, and Stefanski 
(1995). 

Another extension of likelihood estimation leads us along the lines of Example 
4.5, in which it was seen that an estimator such as the sample median, which 
was not the MLE, was a desirable alternative. Such situations can lead naturally 
to replacing the likelihood function by another function, often with the goal of 
obtaining a robust estimator. 



484 


ASYMPTOTIC OPTIMALITY 


[6.7 


Such an approach was suggested by Huber (1964), resulting in a compromise be¬ 
tween the mean and the median. The mean and the median minimize, respectively, 
J2(xi — a) 2 and ^ |x,- — a |. Huber suggested minimizing instead 


(7.23) £>(*,--a) 

1 = 1 


where p is given by 
(7.24) 


p(x) = 


l x 2 

2 a 

k\x\ 


if \x\ < k 
\k 2 if \x\ > k. 


This function is proportional to x 2 for \x\ < k, but outside this interval, it replaces 
the parabolic arcs by straight lines. The pieces fit together so that p and its derivative 
p' are continuous (Problem 7.22). As k gets larger, p will agree with \x 2 over most 
of its range, so that the estimator comes close to the mean. As k gets smaller, the 
estimator will become close to the median. As a moderate compromise, the value 
k = 1.5 is sometimes suggested. 

The Huber estimators minimizing (7.23) with p given by (7.24) are a subset of 
the class of M-estimators obtained by minimizing (7.23) for arbitrary p. If p is 
convex and even, as is the case for (7.24), it follows from Theorem 1.7.15 that the 
minimizing values of (7.23) constitute a closed interval; if p is strictly convex, the 
minimizing value is unique. If p has a derivative p’ = \[r, the M-estimators M„ 
may be defined as the solutions of the equation 


(7.25) 


n 

Y, VK-T; — a) = 0. 

i=i 


If X \,..., X n are iid according to Fix — 0) where F is symmetric about zero 
and has density /, it turns out under weak assumptions on i// and F that 

(7.26) xfn(M n -Q)-r N[0, a 2 (F, i(r)] 


where 

(7.27) 


/ 2 (x)f(x)dx 
If i r '(x)f(x)dx] 2 ’ 


provided both numerator and denominator on the right side are finite and the 
denominator is positive. 

Proofs of (7.26) can be found in Huber (1981), in which a detailed account of the 
theory of M-estimators is given not only for location parameters, but also in more 
general settings. See also Serfling 1980, Chapter 7, Hampel et al. 1986, Staudte 
and Sheather 1990, as well as Problems 7.24-7.26. 

For 


(7.28) p(x) = - log f(x), 

minimizing (7.23) is equivalent to maximizing |~[ fix, — a ), and the M-estimator 
then coincides with the maximum likelihood estimator. In particular, for known 
F , the M-estimator of 6 corresponding to (7.28) satisfies (7.26) with a 2 = 1/If 
(see Theorem 3.10). Further generalizations are discussed in Note 10.4. 
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The results of this chapter have all been derived in the so-called regular case, 
that is, when the densities satisfy regularity assumptions such as those of Theorems 
2.6, 3.10, and 5.1. Of particular importance for the validity of the conclusions is 
that the support of the distributions P 0 does not vary with 9. Varying support brings 
with it information that often makes it possible to estimate some of the parameters 
with greater accuracy than that attainable in the regular case. 

Example 7.11 Uniform MLE. Let X\,..., X n be iid as U (0, 9). Then, the MLE 
of 9 is 9 n = X ( ,i) and satisfies (Problem 2.6) 

(7.29) n(9 - §„) 4 E( 0, 9). 

Since 9„ always underestimates 9 and has a bias of order 1 /n, the order of the 
error 9 n — 9, considers as an alternative the UMVU estimator S„ = [(n + 1 )/n]X ( „), 
which satisfies 

(7.30) n(9 — <5„) 4 E{—9,9). 

The two asymptotic distributions have the same variance, but the first has expec¬ 
tation 9 , whereas the second is asymptotically unbiased with expectation zero and 
is thus much better centered. 

The improvement of S„ over 9„ is perhaps seen more clearly by considering 
expected squared error. We have (Problem 2.7) 

(7.31) E[n{9„ - 9)] 2 -> 2 9 2 , E[n(S„ - 9)] 2 -* 9 2 . 

Thus, the risk efficiency of 9 n with respect to S n is 1/2. j 

The example illustrates two ways in which such situations differ from the regular 
iid cases. First, the appropriate normalizing factor is n rather than Jn, reflecting 
the fact that the error of the MLE is of order \/n instead of 1 /sjn. Second, the 
MLE need no longer be asymptotically optimal even when it is consistent. 

Example 7.12 Exponential MLE. Let X \,.,, X n be iid according to the expo¬ 
nential distribution £(§, b). Then, the MLEs of f and b are 

(7.32) | = X ( i) and b = -E[Z ; - - X m ]. 

n 

It follows from Problem 1.6.18 that n[X (p — £]//? is exactly (and hence asymp¬ 
totically) distributed as E{ 0, 1). As was the case for 9 in the preceding example, 
/ is therefore asymptotically biased. More satisfactory is the UMVU estimator S„ 
given by (2.2.23), which is obtained from f by subtracting an estimator of the bias 
(Problem 7.27). 

It was further seen in Problem 1.6.18 that 2nb/b is distributed as Since 

(X 2 — n)/^/2n —*■ N( 0, 1) in law, it is seen that */n(b — b) —* N( 0, b 2 ). We shall 
now show that b is asymptotically efficient. For this purpose, consider the case that 
£ is known. The resulting one-parameter family of the X's is an exponential family 

and the MLE b of b is asymptotically efficient and satisfies ~Jn(b — b) N(0, b 2 ) 

(Problem 7.27). Since b and b have the same asymptotic distribution, b is a fortiori 
also asymptotically efficient, as was to be proved. j 



486 


ASYMPTOTIC OPTIMALITY 


[6.7 


Example 7.13 Pareto MLE. Let Xi, ..., X n be iid according to the Pareto dis¬ 
tribution P(a, c ) with density 

(7.33) f(x) = cic a /x a+l , 0 < c < x, 0 < a. 

The distribution is widely used, for example, in economics (see Johnson, Kotz, 
and Balakrishnan 1994, Chapter 20) and is closely connected with the exponential 
distribution of the preceding example through the fact that if X has density (7.33), 
then Y = log X has the exponential distribution E(/, b ) with (Problem 1.5.25) 

(7.34) £ = logc, b = 1/a. 

From this fact, it is seen that the MLEs of a and c are 



and that these estimators are independently distributed, c as P(na, c ) and 2 na/a 
as xln -2 (Problem 7.29). 

Since b is asymptotically efficient in the exponential case, the same is true of 1 //; 
and hence of a. On the other hand, n(X ( \ ) — c ) has the limit distribution E( 0, c/a) 
and hence is biased. As was the case with the MLE of f in Example 7.12, an 
improvement over the MLE c of c is obtained by removing its bias and replacing 
c by the UMVU estimator 


(7.36) 




For the details of these calculations, see Problems 7.29-7.31. 


Example 7.14 Lognormal MLE. As a last situation with variable support, con¬ 
sider a sample Xi, .. ., X„ from a three-parameter lognormal distribution, defined 
by the requirement that Z; = log(X; — £) are iid as N(y, cr 2 ), so that 

(7.37) /(*;£, K,er 2 ) = - 1 -j= exp ~-^[log(* - f) - y] 2 } 

when x > £, and / = 0 otherwise. When f is known, the problem reduces to that 
of estimating the mean y and variance cr 2 from the normal sample Z \,..., Z„. 
However, when f is unknown, the support varies with Although in this case the 
density (7.37) tends to zero very smoothly at / (Problem 7.34), the theory of Section 
6.5 is not applicable, and the problem requires a more powerful approach such as 
that of Le Cam (1969). [For a discussion of the literature on this problem, see, for 
example, Johnson, Kotz, and Balakrishnan 1994, Chapter 14. A comprehensive 
treatment of the lognormal distribution is given in Crow and Shimizu (1988).] 
The difficulty can be circumvented by a device used in other contexts by Kemp- 
thorne (1966), Lambert (1970), and Copas (1972a), and suggested for the present 
problem by Giesbrecht and Kempthorne (1976). These authors argue that observa¬ 
tions are never recorded exactly but only to the nearest unit of measurement. This 
formulation leads to a multinomial model of the kind considered for one parameter 
in Example 4.6, and Theorem 5.1 is directly applicable. 
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The corresponding problem for the three-parameter Weibull distribution is re¬ 
viewed in Scholz (1985). For further discussion of such irregular cases, see, for 
example, Polfeldt 1970 and Woodrofe 1972. || 


Although the MLE, or bias-corrected MLE, may achieve the smallest asymptotic 
variance, it may not minimize mean squared error when compared with all other 
estimators. This is illustrated by the following example in which, for the sake of 
simplicity, we shall consider expected squared error instead of asymptotic variance. 


Example 7.15 Second-order mean squared error. Consider the estimation of 
a 2 on the basis of a sample X\,..., X n from N{ 0, a 2 ). The MLE is then 

a 2 = - EX, 2 , 
n 

which happens to be unbiased, so that no correction is needed. Let us now consider 
the more general class of estimators 

(7.38) 5„ = ( / - + 4)sX 2 . 

\ n « z / 


It can be shown (Problem 7.32) that 


(7.39) 


E(S n 


9 9 2cr 4 (4fl + a 2 )a 4 

ay =-+- 9 -+ o 

n n~ 



Thus, the estimators S„ are all asymptotically efficient, that is, nE(S„ — 0) 2 —» 
1 / 1(6) where 6 = a 2 . However, the MLE does not minimize the error in this class 
since the term of order 1 /n 2 is minimized not by a = 0 (MLE) but by a = —2, so 
that (l/n — 2/n 2 )TiXj has higher second-order efficiency than the MLE. In fact, 
the normalized limiting risk difference between the MLE (a = 0) relative to S„ 
with a = —2 is 2, that is, the limiting risk of the MLE is larger (Problem 7.32). || 


A uniformly best estimator (up to second-order terms) typically will not ex¬ 
ist. The second-order situation is thus similar to that encountered in the exact 
(small-sample) theory. One can obtain uniform second-order optimality by impos¬ 
ing restrictions such as first-order unbiasedness, or must be content with weaker 
properties such as second-order admissibility or minimaxity. An admissibility re¬ 
sult (somewhat similar to Theorem 5.2.14) is given by Ghosh and Sinha (1981); 
the minimax problem is treated by Levit (1980). 


8 Asymptotic Efficiency of Bayes Estimators 

Bayes estimators were defined in Section 4.1, and many of their properties were 
illustrated throughout Chapter 4. We shall now consider their asymptotic behavior. 

Example 8.1 Limiting binomial. If X has the binomial distribution b(p. n) and 
the loss is squared error, it was seen in Example 4.1.5 that the Bayes estimator of 
p corresponding to the beta prior B(a, b) is 


S„(X) = (a + X)/(ci + b + n). 
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Thus, 

(X \ T X" 

*Jn[8 n (X) — p]= \/n \ - p 1 H--- a — (a + b) — 

\n ) a + b + n |_ n 

and it follows from Theorem 1.8.10 that «Jn[8„(X) — p\ has the same limit distri¬ 
bution as «Jn[X/n — p], namely the normal distribution N[0, p( 1 — p)]. So, the 
Bayes estimator of the success probability p, suitably normalized, has a normal 
limit distribution which is independent of the parameters of the prior distribution 
and is the same as that of the MLE X/n. Therefore, these Bayes estimators are 
asymptotically efficient. (See Problem 8.1 for analogous results.) | 

This example raises the question of whether the same limit distribution also 
obtains when the conjugate priors in this example are replaced by more general 
prior distributions, and whether the phenomenon persists in more general situa¬ 
tions. The principal result of the present section (Theorem 8.3) shows that, under 
suitable conditions, the distribution of Bayes estimators based on n iid random 
variables tends to become independent of the prior distribution as n —> oo and 
that the Bayes estimators are asymptotically efficient. 

Versions of such a theorem were given by Bickel and Yahav (1969) and by Ibrag¬ 
imov and Has’minskii (1972, 1981). The present proof, which combines elements 
from these papers, is due to Bickel. We begin by stating some assumptions. 

Let Xi,..., X„ be iid with density /(x,j(9) (with respect to fi), where 0 is real¬ 
valued and the parameter space £2 is an open interval. The true value of 0 will be 
denoted by do- 

(Bl) The log likelihood function 1(9) satisfies the assumptions of Theorem 2.6. 


To motivate the next assumption, note that under the assumptions of Theorem 

- ~ p 

2 . 6 , if 0 = Q„ is any sequence for which 0^0 then 

( 8 . 1 ) m = im + (9- 9o)l'(0o) - \(0 - 0o) 2 [nim + R„m 


where 

( 8 . 2 ) 


1 p 

— R n (0) —> 0 as n -* oo 
n 


(Problem 8.3). We require here the following stronger assumption. 


(B2) Given any s > 0, there exists 8 > 0 such that in the expansion (8.1), the probability 
of the event 

<sj >e 

tends to zero as n -*■ oo. 

In the present case it is not enough to impose conditions on 1(0) in the neighbor¬ 
hood of 0o, as is typically the case in asymptotic results. Since the Bayes estimators 
involve integration over the whole range of 6 values, it is also necessary to control 
the behavior of 1(0) at a distance from do- 


sup 1 -R„(9) :|0-0 O | 
n 
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(B3) For any S > 0, there exists e > 0 such that the probability of the event 
1 


(8.4) sup j -[1(6) - /«9o)] : \9 - 0 O | > 8 j < -e 

tends to 1 as n —» oo. 

(B4) The prior density n of 6 is continuous and positive for all 6 € Q,. 
(B5) The expectation of 6 under n exists, that is. 


(8.5) 


/ 


\9\n(6)d0 < oo. 


To establish the asymptotic efficiency of Bayes estimators under these assump¬ 
tions, we shall first prove that for large values of n, the posterior distribution of 0 
given the X’s is approximately normal with 


( 8 . 6 ) 


mean = Qq + 


1 


-I'(do) and variance = l/n/(0o). 


nl(Oo) 

Theorem 8.2 Ifjt*(t\x ) is the posterior density of J~n(Q — T n ) where 

1 


(8.7) 


(i) then if( B1)-(B4) hold, 


T n - Oo + 


n I (do) 


rm, 


( 8 . 8 ) J \jr*(t\x) - Sm)(t>[tSm) 

(ii) If, in addition, (B5) holds, then 


dt 4 0. 


(8.9) 


/ 


(1 + 1 * 1 ) 


7T*(t\x)-/Jm<p\ts/im 


dt 4 o. 


Proof, (i) By the definition of T n , 
(8.10) 7r*(r|x) 


71 

i T " + 'sTn, 

)exp 

t 

( Tn + 4) 

] 

f”( 

f + *) 

exp 


T - + 

J du 


where 

( 8 . 11 ) 

and 

( 8 . 12 ) 


: e a(,) 7t ( T n + — ) /C„ 


co(t) = i\T n + — \- im - 


1 


2nim 


u'mf 


C, 


4 


e® ( " , 7r ( T„ + ) du. 

n , 


We shall prove at the end of the section that 


(8.13) 


/1 


4 


e" ( °jr I T„ + 


*/n 


— e 


-t 2 m 0 )/2 


7t(0 O ) 


dt 4 0 , 
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so that 

(8.14) C n 4 J e- ,2im/2 7t(9 Q )dt = 74(00)^271/7(00). 


The left side of (8.8) is equal to J/C„, where 


(8.15) 



c n ^m)4>\tVm) 


dt 


and, by (8.14), it is enough to show that J -a- 0. 
Now, J < J\ + Ji where J\ is given by (8.13) and 


h = j \c a y/l(^[ty/W) 
I C n yfTW) 


— exp 


VTjx 



r 

r a i 

7t(0 0 ) 

J exp 



~2 m) 


dt. 


7T(0 O ) 


dt 


By (8.13) and (8.14), J\ and Jo tend to zero in probability, and this completes the 
proof of part (i). 

(ii) The left side of (8.9) is equal to 


1 , 1 ,, 

< yriA + A) 


where 7', J[, and J' 2 are obtained from 7, J\, and Jo, respectively, by inserting the 
factor (1 + |f |) under the integral signs. It is therefore enough to prove that J[ and 
7,' both tend to zero in probability. The proof for 7) is the same as that for 7>; the 
proof for J\ will be given at the end of the section, together with that for J\. □ 

On the basis of Theorem 8.2, we are now able to prove the principal result of 
this section. 


Theorem 8.3 If (B1 )-(B5) hold, and if & n is the Bayes estimator when the prior 
density is n and the loss is squared error, then 

(8.16) ATi{e„ - e 0 ) 4 tv [ o , 1/ i(0q)], 

so that 6 n is consistent 7 and asymptotically efficient. 

Proof. We have 

Vn(6 n - Oo) = sfn(Q n - T „) + yfn(J n - 6> 0 ). 


By the CLT, the second term has the limit distribution N [0. 1 / I(0q)], so that it only 
remains to show that 

(8.17) - T n ) 4 0. 

Note that Equation (8.10 ) says that 7r*(f|x) = -An(T n + -^|x), and, hence, by a 
change of variable, we have 


0 n 


J 0n(0\x)de 


7 A general relationship between the consistency of MLEs and Bayes estimators is discussed by 
Strasser (1981). 
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■/ 


—— + T„ ) 7T*(t\x)dt 
Jn 


1 

■s/n 


J 


tn*(t\x)dt + T n 


and hence 


*Jn(O n — T n ) = J t7t*(t\x)dt. 
Now, since / t*JI(0a)<p [? V Kflo) dt = 0, 


Vn\0 n -T n \= J tTT*(t\x)dt - J tyjl(0 0 )<t> tyjI(0o ) 


dt 


dt. 


which tends to zero in probability by Theorem 8.2. □ 

Before discussing the implications of Theorem 8.3, we shall show that assump¬ 
tions (B1)-(B5) are satisfied in exponential families. 

Example 8.4 Exponential families. Let 

f(x,:\0 ) = e ° T <*>-W\ 


so that 

A(9) = log J e e ™dfji(x). 

Recall from Section 1.5 that A is differentiable to all orders and that 


A'(9)=E 6 [T(X)], 

A"(e) = var e [T(X)] = 1(9). 

Suppose 1(0) > 0. Then, 

1(0) - l(6o) = (0- Oo)J:T(X i ) - n[A(0) - A(0 0 )] 

(8.18) = (0 - OomnX') - A'(0 O )] 

—n{[A(0) - A(m - [(0 - 0o)A'(0 o )]}. 

The first term is equal to (0 — 0o)1'(0q). Apply Taylor’s theorem to A(0) to find 

A(0) = A(0 0 ) + (0- e 0 )A'(0 0 ) + l -(0 - 0 o ) 2 A"(0*), 

so that the second term in (8.18) is equal to (— n/2)(0 — 6o) 2 A"(6*). Hence, 

KO) - i(o 0 ) = (o- 0o)i'(0o) - n -(o - 0 o ) 2 A"(e*). 

To prove (B2), we must show that 

A"(G*) = I(0 0 )+-R n (0) 


where 


R n (0) = n[A"(G*) - I(0 0 )] 
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satisfies (8.3); that is, we must show that given e, there exists 8 such that the 
probability of 

sup{|A"(0*) - 7(0o)| : \e-0 Q \ <3} >e 
tends to zero. This follows from the facts that 1(9) = A"(6) is continuous and that 
9* -A- 00 as 9 -* 9 q . 

To see that (B3) holds, write 


-(1(0) - K9 o)] = (9 - do) 
n 


-E(T(X i )-A'(9 o)] 
n 


A(9) - A( go) 
e-9 Q 


A'm 


and suppose without loss of generality that 9 > 9q. 

Since A"(9) > 0, so that A(9) is strictly convex, it is seen that 9 > 9q implies 
[A(6>) - A(9q)]/(9 - 9 0 ) > A!(9 0 ). On the other hand, £|T(X ( ) - A'(9 0 )\/n 4 0 
and hence with probability tending to 1, the factor of (9 — 9 (] ) is negative. It follows 
that 


sup 

< 8 


-(1(9) - l(9 0 )] :9-9 0 > 8 
n 


H[T(X,)-A'(9o)\ 


inf 


A(9) - A(9p) 
9-9 0 


A'(9o) :9-9o>8 


and hence that (B3) is satisfied. 


Theorems 8.2 and 8.3 were stated under the assumption that n is the density of 
a proper distribution, so that its integral is equal to 1. There is a trivial but useful 
extension to the case in which f n(9)d9 = oo but where there exists no, so that 
the posterior density 


n(9\x u .. ,,x no ) = 


UZi /(*,-|gMe) 

JUZi f(xt\9)n(9)d9 


of 9 given x\,... ,x na is, with probability 1, a proper density satisfying assumptions 
(B4) and (B5). The posterior density of 9 given Xi,,X n (n > no) when 9 has 
prior density it is then the same as the posterior density of 9 given X „ 0+ \,..., X„ 
when 9 has prior density if, and the result now follows. 


Example 8.5 Location families. The Pitman estimator derived in Theorem 3.1.20 
is the Bayes estimator corresponding to the improper prior density n(9) = 1. If 
X \, ..., X n are iid with density f(x\ — 9) satisfying (B1)-(B3), the posterior den¬ 
sity after one observation X ] = a'i is f(x, —9) and hence a proper density satisfying 
assumption (B5), provided Eg\X\\ < oo (Problem 8.4). Under these assumptions, 
the Pitman estimator is therefore asymptotically efficient. 8 An analogous result 
holds in the scale case (Problem 8.5). 


Theorem 7.9 can be generalized further. Rather than requiring the posterior 
density if to be proper with finite expectation after a fixed number no of observa- 


For a more general treatment of this result, see Stone 1974. 
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tions, it is enough to assume that it satisfies these conditions for all n > no when 
(X\, ..., X„ 0 ) e S„ 0 , where P(S„ 0 ) -> 1 as no —>■ oo (Problem 8.7). 

Example 8.6 Binomial. Let X , be independent, taking on the values 1 and 0 
with probability p and q = 1 — p, respectively, and let it(p) = 1 / pq. Then, the 
posterior distribution of p will be proper (and will then automatically have finite 
expectation) as soon as 0 < EX,- < n, but not before. Since for any 0 < p < 1 
the probability of this event tends to 1 as n -> oo, the asymptotic efficiency of the 
Bayes estimator follows. j 

Theorem 8.3 provides additional support for the suggestion, made in Section 
4.1, that Bayes estimation constitutes a useful method for generating estimators. 
However, the theorem is unfortunately of no help in choosing among different 
Bayes estimators, since all prior distributions satisfying assumptions (B4) and (B5) 
lead to the same asymptotic behavior. In fact, if 9„ and Q' n are Bayes estimators 
corresponding to two different prior distributions A and A' satisfying (B4) and 
(B5), (8.17) implies the even stronger statement, 

(8.19) MK - K) 0. 

Nevertheless, the interpretation of 0 as a random variable with density tc{9) 
leads to some suggestions concerning the choice of tc. Theorem 8.2 showed that 
the posterior distribution of 6, given the observations, eventually becomes a normal 
distribution which is concentrated near the true 9q and which is independent of tt. 
It is intuitively plausible that a close approximation to the asymptotic result will 
tend to be achieved more quickly (i.e., for smaller n) if n assigns a relatively 
high probability to the neighborhood of 9q than if this probability is very small. A 
minimax approach thus leads to the suggestion of a uniform assignment of prior 
density. It is clear what this means for a location parameter but not in general, 
since the parameterization is arbitrary and reparametrization destroys uniformity. 
In addition, it seems plausible that account should also be taken of the relative 
informativeness of the observations corresponding to different parameter values. 

As discussed in Section 4.1, proposals for prior distributions satisfying such 
criteria have been made (from a somewhat different point of view) by Jeffreys and 
others. For details, further suggestions, and references, see Box and Tiao 1973, 
Jaynes 1979, Berger and Bernardo 1989,1992a, 1992b, and Robert 1994a, Section 
3.4. 

When the likelihood equation has a unique root 9 n (which with probability 
tending to 1 is then the MLE), this estimator has a great practical advantage over 
the Bayes estimators which share its asymptotic properties. It provides a unique 
estimating procedure, applicable to a large class of problems, which is supported 
(partly because of its intuitive plausibility and partly for historical reasons) by a 
substantial proportion of the statistical profession. This advantage is less clear in 
the case of multiple roots where asymptotically efficient likelihood estimators such 
as the one-step estimator (4.11) depend on a somewhat arbitrary initial estimator 
and need no longer agree with the MLE even for large n. 

In the multiparameter case, calculation of Bayes estimators often require the 
computationally inconvenient evaluation of multiple integrals. However, this diffi- 
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culty can often be overcome through Gibbs sampling or other Monte Carlo-Markov 
chain algorithms; see Section 4.5. 

To resolve the problem raised by the profusion of asymptotically efficient es¬ 
timators, it seems natural to carry the analysis one step further and to take into 
account terms (for example, in an asymptotic expansion of the distribution of the 
estimator) of order 1 /n or 1 /rr ]/2 . Investigations along these lines have been un¬ 
dertaken by Rao (1961), Peers (1965), Ghosh and Subramanyam (1974), Efron 
(1975, 1982a), Pfanzagl and Wefelmeyer (1978-1979), Tibshirani (1989), Ghosh 
and Mukerjee (1991, 1992, 1993), Barndorff-Nielsen and Cox (1994), and Datta 
and Ghosh (1995) (see also Section 6.4). They are complicated by the fact that to 
this order, the estimators tend to be biased and their efficiencies can be improved by 
removing these biases. For an interesting discussion of these issues, see Berkson 
(1980). The subject still requires further study. 

We conclude this section by proving that the quantities J\ [defined by (8.13)] and 
J[ tend to zero in probability. For this purpose, it is useful to obtain the following 
alternative expression for a>(t). 


Lemma 8.7 The quantity co(t), defined by (8.11), is equal to 


t z 


1 


(8.20) aj(r) = -im— - — R ,,\T« + — 
In In 


t + 


1 


/(6»o)V« 




where R„ is the function defined in (8.1) (Problem 8.9). 


Proof for J \. To prove that the integral (8.13) tends to zero in probability, divide 
the range of integration into the three parts: (i) |f| < M, (ii) |f| > S^/n, and (iii) 
M < \t | < 5 *Jn, and show that the integral over each of the three tends to zero in 
probability. 

(i) |f| < M. To prove this result, we shall show that for every 0 < M < oo. 


( 8 . 21 ) 


sup 




- e- IW ' 2 n(d Q ) 


0, 


where here and throughout the proof of (i), the sup is taken over \t\ < M. The result 
will follow from (8.21) since the range of integration is bounded. Substituting the 
expression (8.20) for a>(t), (8.21) is seen to follow from the following two facts 
(Problem 8.10): 


( 8 . 22 ) 



1 

— R„ 
n 



+ 


1 

l(0o)^/n 



0 


and 

(8.23) 


sup 


7X 


Tn H- 7 = 

Jn 


7T(0o) 


The second of these is obvious from the continuity of ir and the fact that (Problem 

8.11) 


(8.24) 


p 
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To prove (8.22), it is enough to show that 

(8.25) sup ' R„ (T„ + 4=) 4 0 

n \ y/n) 

and 

1 1 , 

(8.26) - —l (6 q) is bounded in probability. 

1 ( 0 o) Vn 

Of these, (8.26) is clear from (Bl) and the central limit theorem. To see (8.25), 
note that \t\ < M implies 


M 1 M 

T n - 7 = < T n H —— < T„ H —— 

Jn Jn Jn 


and hence, by (8.24), that for any 8 > 0, the probability of 

Oo — 8 < T n H—— < Oo + 8 
yjn 

will be arbitrarily close to 1 for sufficiently large n . The result now follows from 
(B2). 

(ii) M < \t | < Sy/n. For this part it is enough to prove that for \t\ < 8y/Ti ’, the 
integrand of J\ is bounded by an integrable function with probability > 1 — s. 
Then, the integral can be made arbitrarily small by choosing a sufficiently large M. 
Since the second term of the integrand of (8.13) is integrable, it is enough to show 
that such an integrable bound exists for the first term. More precisely, we shall 
show that given s > 0, there exists 8 > 0 and C < oo such that for sufficiently 
large n, 

(8.27) P e wU) 7r (r n + -M < Ce“' 2/(0o)/4 for all \t\ < 8yfK > 1 - s. 

The factor n(T„ + t/y/n) causes no difficulty by (8.24) and the continuity of n, 
so that it remains to establish such a bound for 

t 2 i / t \ r, (/'(0 o )) 2 ii 

(8.28) exp w(t) < exp ~ — I(0 Q ) + - R n ( T„ + — ) r + | . 

2 n \ y/n) [ nl (Oo) \\ 

For this purpose, note that 

\t | < S'y/n implies T„ — 8' < T„ H—— + S' 

y/n 


and hence, by (8.24), that with probability arbitrarily close to 1, for n sufficiently 
large, 

\t\ < 8 y/n implies T„ + — — 0q < 2 8 . 

By (B2), there exists 8' such that the latter inequality implies 

P sup -R n (T„ + 4=) < \l(Oo) > 1 - e. 

I \t\<S'y/n "V VnJ 4 J 
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Combining this fact with (8.26), we see that the right side of (8.28) is < C'e * 2/ W>)/4 
for all t satisfying (ii), with probability arbitrarily close to 1, and this establishes 
(8.27). 

(iii) |f| > <5 s/n. As in (ii), the second term in the integrand of (8.13) can be 
neglected, and it is enough to show that for all 8, 


/ exp[<y(r)]7T ( T n + —— ) dt 

J\t\>8^/n \ J 


(8.29) 


1 


'n I 7r(0)exp 

l\e-T n \>8 

U 4 °. 


m - m) 


From (8.24) and (B3), it is seen that given 8, there exists e such that 

sup e [m ~ m>] < e~ m 

\8-T„\>8 


with probability tending to 1. By (8.26), the right side of (8.29) is therefore bounded 
above by 

(8.30) CVne~" s f n(9)de = CVne~ ns 

with probability tending to 1, and this completes the proof of (iii). 

To prove (8.13), let us now combine (i)-(iii). Given e > 0 and <5 > 0, choose M 
so large that 


(8.31) 


/»00 " 

/ ( 

J m L 


" 

r t 2 i 


r t 2 i 


C exp 

w 

l -—H 

|<N 

1 

_i 

+ exp 


7r(0 O ) 


dt < 


s 

3’ 


and, hence, that for sufficiently large n, the integral (8.13) over (ii) is < e/3 with 
probability > 1 — e. Next, choose n so large that the integrals (8.13) over (i) and 
over (iii) are also < e/3 with probability > 1 — e. Then, P[J\ < e] > 1 — 3e, and 
this completes the proof of (8.13). 

The proof for J\ requires only trivial changes. In part (i), the factor [1 + \t |] is 
bounded, so that the proof continues to apply. In part (ii), multiplication of the 
integrand of (8.31) by [1 + \t |] does not affect its integrability, and the proof goes 
through as before. Finally, in part (iii), the integral in (8.30) must be replaced by 
Cne~ ns f \9\n(6)d0, which is finite by (B5). 


9 Problems 
Section 1 

1.1 Let X u ..., X n be iid with E(X t ) = £. 

(a) If the X, s have a finite fourth moment, establish (1.3) 

(b) For k a positive integer, show that E(X — %) 2k ~' and E(X — § ) 2k , if they exist, are 
both 0(l/n*). 

[Hint: Without loss of generality, let f = 0 and note that E(X^ X? ■ ■ •) = 0 if any of the 
r’s is equal to 1.] 
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1.2 For fixed n, describe the relative error in Example 1.3 as a function of p. 

1.3 Prove Theorem 1.5. 

1.4 Let V, .... VbeiidasiVd;, cr 2 ),cr 2 known,andlet,g(f) = £ r ,r = 2,3,4.Determine, 
up to terms of order 1/n, 

(a) the variance of the UMVU estimator of g(f); 

(b) the bias of the MLE of g(§). 

1.5 Let Xi, .. . , X„ be iid as (V(§, a 2 ), tj known. For even r, determine the variance of 
the UMVU estimator (2.2.4) of a r up to terms of order r. 

1.6 Solve the preceding problem for the case that £ is unknown. 

1.7 For estimating p m in Example 3.3.1, determine, up to order 1/n, 

(a) the variance of the UMVU estimator (2.3.2); 

(b) the bias of the MLE. 

1.8 Solve the preceding problem if p m is replaced by the estimand of Problem 2.3.3. 

1.9 Let Xi, .. ., X n be iid as Poisson P(9). 

(a) Determine the UMVU estimator of P(Xj = 0) = e~ e . 

(b) Calculate the variance of the estimator of (a) up to terms of order 1/n. 

[Hint: Write the estimator in the form (1.15) where h(X) is the MLE of e~ e .] 

1.10 Solve part (b) of the preceding problem for the estimator (2.3.22). 

1.11 Under the assumptions of Problem 1.1, show that £|X — t;\ 2k ~ l = 0(n~ k+l/2 ).[Hint: 
Use the fact that E\X — f | 2 *^' < [E(X — I) 4 *- 2 ] 1 / 2 together with the result of Problem 
LI.] 

1.12 Obtain a variant of Theorem 1.1, which requires existence and boundedness of only 
h"' instead of iP lv \ but where R„ is only 0(n~ 3/2 ). 

[Hint: Carry the expansion (1.6) only to the second instead of the third derivative, and 
apply Problem 1.11.] 

1.13 To see that Theorem 1.1 is not necessarily valid without boundedness of the fourth 
(or some higher) derivative, suppose that the Vs are distributed as (V(£, a 2 ) and let 
h(X) = e x . Then, all moments of the Vs and all derivatives of h exist. 

(a Show that the expectation of h(X) does not exist for any n , and hence that E { *Jn [h(X) 
— /7(§)]J 2 = oo for all values of n. 

(b On the other hand, show that ^/n [h(X) — /;(§)] has a normal limit distribution with 
finite variance, and determine that variance. 


1.14 Let Xi,...,X n be iid from the exponential distribution with density 

(l/<9)e _jr/ V* > 0, and 9 > 0. 


(a) Use Theorem 1.1 to find approximations to E{Vx) and var (VX). 

(b) Verify the exact calculation 


var(VV 


t 1 / r(n + 1/2) \ 2 
n ^ T(n) ) 


9 


and show that lim^oo n var (VX) = 9/4. 

(c) Reconcile the results in parts (a) and (b). Explain why, even though Theorem 1.1 
did not apply, it gave the correct answer. 
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(d) Show that a similar conclusion holds for h(x) = I /x. 

[Hint: For part (b), use the fact that T = E X ; has a gamma distribution. The limit can 
be evaluated with Stirling’s formula. It can also be evaluated with a computer algebra 
program.] 

1.15 Let Xi, ..., X„ be iid according to 1/(0, 9). Determine the variance of the UMVU 
estimator of 9 k , where k is an integer, k > — n. 

1.16 Under the assumptions of Problem 1.15, find the MLE of 9 k and compare its expected 
squared error with the variance of the UMVU estimator. 

1.17 Let X], ..., X n be iid according to U( 0 , 9), let T = max(Xj, ..., X„), and let h be 
a function satisfying the conditions of Theorem 1.1. Show that 

E[h(T)\ = h{9) - -h\9 ) + \[9h'(9) + 9 2 h"(9)] + O (% ) 
n n z \ l 

and 

0 2 

var [h(T)] = — [h'(9)f + 0 
n l 

1.18 Apply the results of Problem 1.17 to obtain approximate answers to Problems 1.15 
and 1.16, and compare the answers with the exact solutions. 

1.19 If the X’s are as in Theorem 1.1 and if the first five derivatives of h exist and the 
fifth derivative is bounded, show that 

E[h(X)] = h(^) + l -h"— + J-[4/t 'V 3 + 3 h™a*] + 0(n ~ 5/2 ) 

2 n 24n 2 

and if the fifth derivative of h 2 is also bounded 

var[/r(X)] = (h a ) - 1 — -[h'h"/j. 3 + (h'h'" H—/j"’ )cr 4 ] + 0(?i~ 5/2 ) 

n n 2 2 

where fi 3 = E(X — |) 3 . 

[Hint: Use the facts that E(X — f) 3 = li 3 /n 2 and E(X — f) 4 = 3ff 4 //? 2 + 0(l/n 3 ).] 

1.20 Under the assumptions of the preceding problem, carry the calculation of the vari¬ 
ance (1.16) to terms of order 1/n 2 , and compare the result with that of the preceding 
problem. 

1.21 Carry the calculation of Problem 1.4 to terms of order 1/n 2 . 

1.22 For the estimands of Problem 1.4, calculate the expected squared error of the MLE 
to terms of order 1/n 2 , and compare it with the variance calculated in Problem 1.21. 

1.23 Calculate the variance (1.18) to terms of order 1/n 2 and compare it with the expected 
squared error of the MLE carried to the same order. 

1.24 Find the variance of the estimator (2.3.17) up to terms of the order 1/n 3 . 

1.25 For the situation of Example 1.12, show that the UMVU estimator <5,„ is the bias- 
corrected MLE, where the MLE is S 3 „. 

1.26 For the estimators of Example 1.13: 

(a) Calculate their exact variances. 

(b) Use the result of part (a) to verify (1.27). 

1.27 (a) Under the assumptions of Theorem 1.5, if all fourth moments of the X iv are 
finite, show that E(X i — £,)(X,- —(/,■) = Ojj/n and that all third and fourth moments 
E(X; — |i)(Xj — £/)(Xj- — £(.), and so on are of the order 1/n 2 . 
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(b) If, in addition, all derivatives of h of total order < 4 exist and those of order 4 are 
uniformly bounded, then 


E[h(X i 


,X s )\ = htf u 


’ ?s)+ 2 n 


1 S S 


d 2 h(M i, 


■ &) 


i=l 7=1 


3|i3?7 


+ R„ 


and if the derivatives of h 2 of order 4 are also bounded, 

1 3/r dh 

var [h{X l ,, Xj)] = - E Ecr,j — —— + R n 
n 3S?; 

where the remainder in both cases is 0(l/n 2 ). 

1.28 On the basis of a sample from N(£j, a 2 ), let P„(f, a ) be the probability that the 
UMVU estimator X 2 — a 2 /n of f 2 (cr known) is negative. 


(a) Show that P„(f, cr) is a decreasing function of ^fn |/cr. 

(b) Show that P„(f, cr) -*■ 0 as n —>■ oo for any fixed | ^ 0 and cr. 

(c) Determine the value of P„( 0, cr). 


[Hint: P„(^, cr) = P[— 1 — *Jn%/o < Y < 1 — *J7\ §/cr], where F = -Jn[X — §)/cr is 
distributed as N(0, 1).] 

1.29 Use the t -distribution to find the value of P„(0, cr) in the preceding problem for the 
UMVU estimator of f 2 when cr is unknown for representative values of n. 

1.30 Fill in the details of the proof of Theorem 1.9. (See also Problem 1.8.8.) 

1.31 In Example 8.13 with 9 = 0, show that S 2n is not exactly distributed as cr 2 (xf — l)/n. 

1.32 In Example 8.13, let 3 4 „ = max(0, X 2 — a 2 /n), which is an improvement over <5 ln . 

(a) Show that ^/n(3 4 „ — 9 2 ) has the same limit distribution as s/n(&\n ~0 2 ) when 8 0. 

(b) Describe the limit distribution of n<5 4 „ when 9=0. 


[Hint: Write <5 4 „ = + R„ and study the behavior of /?„.] 

1.33 Let X have the binomial distribution b(p,n), and let g(p) = pq. The UMVU 
estimator of g(p) is 3 = X(n — X)/n{n — 1). Determine the limit distribution of «Jn(& — 
pq) and n(8 — pq) when g'(p) ^ 0 and g\p) = 0, respectively. 

[Hint: Consider first the limit behavior of 8' = X(n — X)/n 2 .[ 

1.34 Let Xi ,..., X n be iid as N (%, 1). Determine the limit behavior of the distribution 
of the UMVU estimator of p = P[\X t \ < u], 

1.35 Determine the limit behavior of the estimator (2.3.22) as n -*■ oo. 

[Hint: Consider first the distribution of log^fT).] 

1.36 Let Xi, ..., X n be iid with distribution Pg, and suppose 8„ is UMVU for esti¬ 
mating g(9) on the basis of X lt ..., X n . If there exists n 0 and an unbiased estimator 

(SofV). X„ 0 ) which has finite variance for all 9, then 8„ is consistent for g(9). 

[Hint: For n = kn 0 (with k an integer), compare 8„ with the estimator 

^{«o(*i. X no ) + 3 0 (V„o+i, ..., X 2no ) + ...)]. 

1.37 Let T„ be distributed as N( 0, 1) with probability ?r„ and as N( 0, r 2 ) with probability 
1 — 7i n . If r„ —>■ oo and n n —> n, determine for what values of n the sequence { Y „} does 
and does not have a limit distribution. 

1.38 (a) In Problem 1.37, determine to what values var(T„) can tend as n —> oo if 
7t n —> 1 and t„ -*■ oo but otherwise both are arbitrary. 
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(b) Use (a) to show that the limit of the variance need not agree with the variance of 
the limit distribution. 

1.39 Let b m<n , m,n = 1, 2, ..be a double sequence of real numbers, which for each 
fixed m is nondecreasing in n. Show that lim,,.^^ lim,,,^,*, b m = lim„,inf b m „ and 
lim,,,^^ linin^oo b m n = lim,„,„^oo sup b,„ „ provided the indicated limits exist (they may 
be infinite) and where lim inf b„ h „ and lim sup b m n denote, respectively, the smallest and 
the largest limit points attainable by a sequence b mk , lk , k = 1,2,..., with m k —»■ 00 and 
n k —»■ 00 . 

Section 2 

2.1 Let Xj, ..., X„ be iid as (V(0, 1). Consider the two estimators 



if $n — 

n 

if S n > a n 


where S„ = E(X, — X) 2 , P(S n > a n ) = 1/n, and = (Xi + • • • + X kn )/k n with k„ the 
largest integer < «Jn. 

(a) Show that the asymptotic efficiency of T' n relative to T„ is zero. 

(b) Show that for any fixed s > 0. P[\T„ — 9\ > e] = - +0 (-), but P [ | T' n — 9\ > e] = 

°(;)- ' . 

(c) For large values of n , what can you say about the two probabilities in part (b) when 
e is replaced by a/^/nl (Basu 1956). 

c 

2.2 If k„[S n — g(#)] —> H for some sequence k„, show that the same result holds if k„ is 
replaced by k' n , where k n /k' n —> 1. 

2.3 Assume that the distribution of Y„ = — g{9)) converges to a distribution 

with mean 0 and variance v(6). Use Fatou’s lemma (Lemma 1.2.6) to establish that 
vare(<5„) -*■ 0 for all 9. 

2.4 If Xi, ..., X„ are a sample from a one-parameter exponential family (1.5.2), then 
2r(X ; ) is minimal sufficient and £[(l/ii)23"(A' I )] = (d/dr])A(7i) = r. Show that for 
any function g(-) for which Theorem 1.8.12 holds, g((l/«)Sr(X,)) is asymptotically 
unbiased for g(r). 

2.5 If X[,..., X„ are iid n(fi, a 2 ), show that S r = [l/(n — l)£(.r,- — L) 2 ] ,/2 is an asymp¬ 
totically unbiased estimator of a r . 

2.6 Let Xi, ..., X„ be iid as 1/(0, 9). From Example 2.1.14, <5„ = (n + \)X (n) /n is the 
UMVU estimator of 9 , whereas the MLE is X ( „). Determine the limit distribution of (a) 
n[9 — <5„] and (b) n[9 — X (n) ], Comment on the asymptotic bias of these estimators. 
[Hint: P(X (n) < y) = y' l /9 n for any 0 < y < 0.] 

2.7 For the situation of Problem 2.6: 

(a) Calculate the mean squared errors of both S„ and X (n) as estimators of 9. 

(b) Show 

«->oo E(8„ - 9) 2 

2.8 Verify the asymptotic distribution claimed for S„ in Example 2.5. 

2.9 Let <5„ be any estimator satisfying (2.2) with g(9 ) = 9. Construct a sequence S' n such 
that Jn(8' n -9) 4 N[ 0, u> 2 «?)] with w(9) = v(9) for 6 » J9 a and w(6 0 ) = 0. 
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2.10 In the preceding problem, construct S' n such that w(9) = i>($) for all 9 9 0 and 0, 

and < v(9) for 9 = 9q and 9\. 

2.11 Construct a sequence (<5„j satisfying (2.2) but for which the bias b n {9) does not tend 
to zero. 

2.12 In Example 2.7 with R„(9) given by (2.11), show that R„(9 ) —>■ 1 for 9 =^0 and that 

R„( 0 ) -»• a 2 . 

2.13 Let b„(9) = Eg(S„) — 9 be the bias of the estimator S„ of Example 2.5. 

(a) Show that 

— (1 — a) r & r- 

b n (9) = -—— / xcj)(x — \fn9)dx\ 

V" J-fyh 

(b) Show that b' n (9) -*■ 0 for any 9 ^0 and b' n (0) -*■ (1 — a). 

(c) Use (b) to explain how the Hodges estimator S„ can violate (2.7) without violating 
the information inequality. 

2.14 In Example 2.7, show that if 9„ = c/^/n, then R„(9 n ) —> a 2 + c 2 ( 1 — a) 2 . 


Section 3 

3.1 Let X have the binomial distribution b(p, n), 0 < p < 1. Determine the MLE of p 

(a) by the usual calculus method determining the maximum of a function; 

(b) by showing that p x q"~ x < (x/n) x [{n — x)/n] n ~ x . 

IHint: (b) Apply the fact that the geometric mean is equal to or less than the arithmetic 
mean to n numbers of which x are equal to np/x and n — x equal to nq/(n — x ).] 

3.2 In the preceding problem, show that the MLE does not exist when p is restricted to 
0 < p < 1 and when x = 0 or = n. 

3.3 Let Xi, ... , X„ be iid according to N(t-, a 2 ). Determine the MLE of (a) f when a is 
known, (b) a when | is known, and (c) (£, a) when both are unknown. 

3.4 Suppose Xi, .... X n are iid as 1V(£, 1) with f > 0. Show that the MLE is X when 
X > 0 and does not exist when X < 0. 

3.5 Let X take on the values 0 and 1 with probabilities p and q. respectively. When it is 
known that 1/3 < p < 2/3, (a) find the MLE and (b) show that the expected squared 
error of the MLE is uniformly larger than that of <5(.v) = 1/2. 

[A similar estimation problem arises in randomized response surveys. See Example 
5.2.2.] 

3.6 When 12 is finite, show that the MLE is consistent if and only if it satisfies (3.2). 

3.7 Show that Theorem 3.2 remains valid if assumption A1 is relaxed to Al': There is a 
nonempty set 12 0 e 12 such that 9 0 e 12 0 and 12 0 is contained in the support of each P e . 

3.8 Prove the existence of unique 0 < a k < a k _i, k = 1,2 .satisfying (3.4). 

3.9 Prove (3.9). 

3.10 In Example 3.6 with 0 < c < 1/2, determine a consistent estimator of k. 

[Hint: (a) The smallest value K of j for which Ij contains at least as many of the X’s 
as any other / is consistent, (b) The value of j for which Ij contains the median of the 
X’s is consistent since the median of f k is in /*,.] 

3.11 Verify the nature of the roots in Example 3.9. 
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3.12 Let X be distributed as N(9, 1). Show that conditionally given a < X < b, the 
variable X tends in probability to b as 9 —» oo. 

3.13 Consider a sample X \,..., X„ from a Poisson distribution conditioned to be posi¬ 
tive, so that P(Xj = x ) = 9 x e~ e /x !(1 — e~ e ) for* = 1,2.Show that the likelihood 

equation has a unique root for all values of *. 

3.14 Let X have the negative binomial distribution (2.3.3). Find an ELE of p. 

3.15 (a) A density function is strongly unimodal , or equivalently log concave, if log fix) 
is a concave function. Show that such a density function has a unique mode. 

(b) Let Xi, ..., X„ be iid with density fix — 9). Show that the likelihood function has 
a unique root if fix)/fix) is monotone, and the root is a maximum if fix)/ fix) 
is decreasing. Hence, densities that are log concave yield unique MLEs. 

(c) Let Xi ,..., X n be positive random variables (or symmetrically distributed about 
zero) with joint density a"TlfiaXi), a > 0. Show that the likelihood equation has 
a unique maximum if xfix)/fix) is strictly decreasing for * > 0. 

(d) If Xi . X„ are iidwith density /(*,•—#) where/is unimodal and if the likelihood 

equation has a unique root, show that the likelihood equation also has a unique root 
when the density of each X t is af[aixt — 60], with a known. 

3.16 For each of the following densities, /(■), determine if (a) it is strongly unimodal 
and (b) x fix)/fix) is strictly decreasing for * > 0. Hence, comment on whether the 
respective location and scale parameters have unique MLEs: 


(a) fix) 

(b) fix) 

(c) fix) 

(d) fix) 


I _ V 

_ e 2 , —oo < x < oo 

V27T 

—Lig-Otog*) 2 , 0 < * < OO 
V2 ft x 

e"7(l + e~ x ) 2 , -oo <x <oo 
r(v + l/2) 1 1 

F(v/2) fvn [i +{x/vff£' 


(normal) 

(lognormal) 
(logistic) 
it with v df) 


3.17 If XX„ are iid with density /(*,• — 9) or afiaxf) and / is the logistic density 
L(0, 1), the likelihood equation has unique solutions 9 and d both in the location and 
the scale case. Determine the limit distribution of fnf) — 9) and fnid — a). 

3.18 In Problem 3.15(b), with / the Cauchy density C(0, a), the likelihood equation has 
a unique root a and fnid — a) —> Ni 0, 2a 2 ). 

3.19 IfXi. X„ are iid as C($, 1), then for any fixed/; there is positive probability (a) 

that the likelihood equation has 2 n — 1 roots and (b) that the likelihood equation has a 
unique root. 

[Hint: (a) If the x’s are sufficiently widely separated, the value of L\9) in the neighbor¬ 
hood of *, is dominated by the term (*,■ — 9)/[ 1 + (x ; — 9j 1 ]. As 9 passes through 
this term changes signs so that the log likelihood has a local maximum near jt;. (b) Let 
the *’s be close together.] 

3.20 If Xi, ... , X„ are iid according to the gamma distribution r($, 1), the likelihood 
equation has a unique root. 

[Hint: Use Example 3.12. Alternatively, write down the likelihood and use the fact that 
r'i9)/ T[9) is an increasing function of 9 .] 
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3.21 Let Xi , ..., X n be iid according to a Weibull distribution with density 

fe(x) = 9x e ~ l e~ x<> , x > 0, 9 > 0, 

which is not a member of the exponential, location, or scale family. Nevertheless, show 
that there is a unique interior maximum of the likelihood function. 

3.22 Under the assumptions of Theorem 3.2, show that 

9 0 H—p'j — L(9q) + 

V«/ 

tends in law to N( 0, 1). 

3.23 Let Xi, ..., X„ be iid according to N(9, a9 2 ), 9 > 0, where a is a known positive 
constant. 

(a) Find an explicit expression for an ELE of 9. 

(b) Determine whether there exists an MRE estimator under a suitable group of trans¬ 
formations. 



;H9 0 ) 


/Vm) 


[This case was considered by Berk (1972).] 

3.24 Check that the assumptions of Theorem 3.10 are satisfied in Example 3.12. 

3.25 For X lt ..., X n iid as DE(9, 1), show that (a) the sample median is an MLE of 9 
and (b) the sample median is asymptotically normal with variance 1/n, the information 
inequality bound. 

3.26 In Example 3.12, show directly that (1 /n)’ET(Xi) is an asymptotically efficient 
estimator of 9 = E^TiX)] by considering its limit distribution. 

3.27 Let Xi, ..., X„ be iid according to 9g(x) + (1 — 9)h(x), where (g, h) is a pair of 
specified probability densities with respect to /i , and where 0 < 9 < 1. 

(a) Give one example of (g, h) for which the assumptions of Theorem 3.10 are satisfied 
and one for which they are not. 

(b) Discuss the existence and nature of the roots of the likelihood equation for n = 1, 
2,3. 

3.28 Under the assumptions of Theorem 3.7, suppose that 9 in and 9 2n are two consistent 
sequences of roots of the likelihood equation. Prove that Pe 0 (9 1 „ = 9 2n ) —»■ 1 as n —> oo. 
[Hint: 

(a) Let S„ = {x : x = (xi,..., x„) such that 9 i„(x) ^ # 2 n(x)j. For all x e S„, there exists 
9* between 9 ln and 9 2n such that L"{9*) = 0. For all x ^ S„, let 9* be the common 
value of 9 ln and 9 2n . Then. 9* is a consistent sequence of roots of the likelihood 
equation. 

(b) {\/n)L"(9*) — (1/ n)L"(9o) —> 0 in probability and therefore (1 /n)L"(9*) —>■ 
— I(9q) in probability. 

(c) Let 0 < s < I(9o) and let 

S' n = jx: ] -L"(9*) < -/Wol + sj. 

Then, Pg 0 (S’ n ) —> 1. On the other hand, L"(9*) = 0 on S„ so that S n is contained in 
the complement of S' (Huzurbazar 1948).] 
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3.29 To establish the measurability of the sequence of roots 9* of Theorem 3.7. we can 
follow the proof of Serfling (1980, Section 4.2.2) where the measurability of a similar 
sequence is proved. 

(a) For definiteness, define 9 n (a) as the value that minimizes 1 9 — 9q\ subject to 

- 3 

9 a -a<9<9 0 + a and —l(9\x)\ g= g = 0. 

99 

Show that 9 n (a) is measurable. 

(b) Show that 0*, the root closest to 9*, is measurable. 

{Hint: For part (a), write the set { 9 n (a ) > /} as countable unions and intersections of 
measurable sets, using the fact that (d/d9) log($ |x) is continuous, and hence measurable.] 


Section 4 
4.1 Let 

I c f a e~ l,M ~ x) dx for 0 < t < 1 
0 for t < 0 

1 for t > 1. 

Show that for a suitable c, the function u is continuous and infinitely differentiable for 
— OO < t < 00. 

4.2 Show that the density (4.1) with £2 = (0, oo) satisfies all conditions of Theorem 3.10 
with the exception of (d) of Theorem 2.6. 

4.3 Show that the density (4.4) with £2 = (0, oo) satisfies all conditions of Theorem 3.10. 

4.4 In Example 4.5, evaluate the estimators (4.8) and (4.14) for the Cauchy case, using 
for 9„ the sample median. 

4.5 In Example 4.7, show that 1(9) is concave. 

4.6 In Example 4.7, if rj = £, show how to obtain a ^/n-consistent estimator by equating 
sample and population second moments. 

4.7 In Theorem 4.8, show that ou = a 12 . 

4.8 Without using Theorem 4.8, in Example 4.13 show that the EM sequence converges 
to the MLE. 

4.9 Consider the following 12 observations from a bivariate normal distribution with 
parameters = fi 2 = 0, of, of , p: 

xt 1 1 -1 -1 2 2 - 2 - 2 * * * * 

Xi I -1 1 -1 * * * * 22-2-2 

where represents a missing value. 

(a) Show that the likelihood function has global maxima at p = ± 1/2, of = a} = 8/3, 
and a saddlepoint at p = 0. of = of = 5/2. 

(b) Show that if an EM sequence starts with p = 0, then it remains with p = 0 for all 
subsequent iterations. 

(c) Show that if an EM sequence starts with p bounded away from zero, it will converge 
to a maximum. 

[This problem is due to Murray (1977), and is discussed by Wu (1983).] 

4.10 Show that if the EM complete-data density /( y, z| 9 ) of (4.21) is in a curved expo¬ 
nential family, then the hypotheses of Theorem 4.12 are satisfied. 
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4.11 In the EM algorithm, calculation of the E-step, the expectation calculation, can 
be complicated. In such cases, it may be possible to replace the E-step by a Monte 
Carlo evaluation, creating the MCEM algorithm (Wei and Tanner 1990). Consider the 
following MCEM evaluation of Q{9\9^), y): 


Given 0, 


(1) Generate Z\, ■ ■ ■, Z*, iid, from k(z\6^y, y), 

(2) Let y) = | TL lQ g£(0ly. z > 

and then calculate 9,j +V) as the value that maximizes Q(9\9 ( j ) , y). 

(a) Show that Q{6\9*\, y) -> Q(9\9 U) , y) as k ->• oo. 

(b) What conditions will ensure that L(@® 1) |y) > L((9®|y) for sufficiently large kl 
Are the hypotheses of Theorem 4.12 sufficient? 

4.12 For the mixture distribution of Example 4.7, that is, 

Xj ~ 8g(x) + (1 — 9)h(x), i = 1, ..., n, independent 

where g(-) and /;(•) are known, an EM algorithm can be used to find the ML estimator 
of 8. Let Z\ , • • • , Z„ , where Z,- indicates from which distribution X, has been drawn, so 

X,\Z, = 1 ~ *(*) 

X,\z,=0~h(.x). 

(a) Show that the complete-data likelihood can be written 

n 

L(9\x, z) = Y[ kig(xt) + ( 1 - Zi)h{ Xi )W Zi ( 1 - 9) l ~ z \ 

1=1 

(b) Show that 2s(Z,-|0, *,-) = 0g(Xi)/[0g(Xi) + (1 — Q)h(Xi)\ and hence that the EM 
sequence is given by 


0(i. 


i n 

= 1 E- 

n ^' , 


0O>S(*i) 


° +1) " ^ 9 u) g(x,) + (l-9 w )h(x i ) 

(c) Show that 6(j } -*■ 9, the ML estimator of 9. 

4.13 For the situation of Example 4.10: 

(a) Show that the M-step of the EM algorithm is given by 


M ; 


EE yu +zi +Z 2 ) /12, 

i=l j =1 


% = y ‘i + Z> ) /3 - A- ' = 1.3 

E' v 9'j / 3 - A. *'=2,4. 

(b) Show that the E-step of the EM algorithm is given by 

z,- = E [TbI/t = /t, o',- = a,-] = jX + oii i = l,3. 
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(c) Under the restriction E,- a f = 0, show that the EM sequence converges to = 
% ~ A. where A = E; 5W 4 - 

(d) Under the restriction E; «,«, = 0, show that the EM sequence converges to a, = 
% ~ A. where A = Ey )>y/10. 

(e) For a general one-way layout with a treatments and «y observations per treatment, 
show how to use the EM algorithm to augment the data so that each treatment has 
n observation. Write down the EM sequence, and show what it converges to under 
the restrictions of parts (c) and (d). 

[The restrictions of parts (c) and (d) were encountered in Example 3.4.9, where they led, 
respectively, to an unweighted means analysis and a weighted means analysis.] 

4.14 In the two-way layout (see Example 3.4.11), the EM algorithm can be very helpful 
in computing ML estimators in the unbalanced case. Suppose that we observe 

Yij k : N (£y, a 2 ), i = 1./, j = I. J, k= 1. n ij} 

where fy = p + cy + fij + Yij . The data will be augmented so that the complete data have 
n observations per cell. 

(a) Show how to compute both the E-step and the M-step of the EM algorithm. 

(b) Under the restriction E,- a i = E ; A; = E; Yij = E, Yij = 0, show that the EM 
sequence converges to the ML estimators corresponding to an unweighted means 
analysis. 

(c) Under the restriction E, n i a i = E ; - n -jPj = E; n i Yij = E ( 'JYij = 0, show that 
the EM sequence converges to the ML estimators corresponding to a weighted 
means analysis. 

4.15 For the one-way layout with random effects (Example 3.5.1), the EM algorithm is 
useful for computing ML estimates. (In fact, it is very useful in many mixed models; 
see Searle et al. 1992, Chapter 8.) Suppose we have the model 

Xjj = n + Ai + Uij O' = 1, ..., i = l,..., s) 

where A t and [7y are independent normal random variables with mean zero and known 
variance. To compute the ML estimates of /x, o^, and cty it is typical to employ an EM 
algorithm using the unobservable A, ’s as the augmented data. Write out both the E-step 
and the M-step, and show that the EM sequence converges to the ML estimators. 

4.16 Maximum likelihood estimation in the probit model of Section 3.6 can be imple¬ 
mented using the EM algorithm. We observe independent Bernoulli variables Xi, ..., X n , 
which depend on unobservable variables Z, distributed independently as N(^,a 2 ), 
where 

I 0 if Zj <u 
f j 1 if Z, > u. 

Assuming that u is known, we are interested in obtaining ML estimates of f and cr 2 . 

(a) Show that the likelihood function is p Ejr, '( 1 — p)" -5 '* 1 ', where 

p = p(z, > U ) = <t> • 

(b) If we consider Z { , ..., Z„ to be the complete data, the complete-data likelihood is 
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and the expected complete-data log likelihood is 

H i n 

-'-\ 0 g{2na 2 ) ^ [E(Zf |X,-) - 2t;i:C/,\X i ) + f 2 ]. 

1 = 1 

(c) Show that the EM sequence is given by 

ftA 1 ) = - ° 0)) 

where 


J 0'+l) ■ 




t 1 (f, ( r 2 ) = £(Z,[X,,f,ff 2 ) and v i (^a 2 ) = E(Zf\X i ,^a 2 ). 


(d) Show that 


E(Zj\Xj, f, a 2 ) = t; + aHj 



E(Zf\Xi, £, <r 2 ) = f 2 + a 2 + c(m + f )H, 



where 


Hi(t) = 


<P(t) 

l - d>(r) 

HO 


if x, = l 

if x, = o. 


(e) Show that f (/) —>• f and —>■ a 2 , the ML estimates of f and ct 2 . 

4.17 Verify (4.30). 

4.18 The EM algorithm can also be implemented in a Bayesian hierarchical model to 
find a posterior mode. Recall the model (4.5.5.1), 


X|<9 ~ f(x\9\ 
@|A. ~ n(9\X), 
A ~ y(X), 


where interest would be in estimating quantities from jt(9\x). Since 

n{9\x) = J n(9,X\x)d X, 

where jt(9, X\x) = n(9\X, x)n(X\x), the EM algorithm is a candidate method for finding 
the mode of rc(9\x), where X would be used as the augmented data. 

(a) Define k(X\9, x) = n(9, X\x) / n (9\x), and show that 


log jr($|.v) 


J log7r(@, X\x)k(X\9*, x)d X 


J \og k(X\9, x)k{X\9*, x)d X. 


(b) If the sequence {9^-,} satisfies 

max J \ogn{9,X\x)k(X\9(j), x)dX = J log7r(# ( j + i), X\x)k(X\9(j), x)d X, 

show that \ogn(9^+i ) \x) > log?r|.xr). Under what conditions will the sequence 
(i 9(j ,} converge to the mode of n(9\x)2 



508 


ASYMPTOTIC OPTIMALITY 


[6.9 


(c) For the hierarchy 

X\9 ~ N(9), 1), 

@|A ~ N(k, 1)), 

A ~ Uniform(—oo, oo), 


show how to use the EM algorithm to calculate the posterior mode of n(9\x). 


4.19 There is a connection between the EM algorithm and Gibbs sampling, in that both 
have their basis in Markov chain theory. One way of seeing this is to show that the 
incomplete-data likelihood is a solution to the integral equation of successive substitu¬ 
tion sampling (see Problems 4.5.9-4.5.11), and that Gibbs sampling can then be used 
to calculate the likelihood function. If L(9 |y) is the incomplete-data likelihood and 
L(9 |y, z) is the complete-data likelihood, define 


L\e |y) 
L\9\y, z) 


1 y) 

/ L(9\y)d9’ 
L(9 |y,z) 

/ L(6 |y, z )d9' 


(a) Show that L*(9\y) is the solution to 


L*(9\y) 



L*(9\y,z)k(z\0',y)dz 


L*(9'\y)d9' 


where, as usual, k{z\9, y) = L(9 |y, z)/L(8\y). 

(b) Show how the sequence 9(j) from the Gibbs iteration, 


9 U) ~ L*(<9 |y, zy-j)), 
z u) ~ k(z\9 u) , y), 


will converge to a random variable with density L*(9 |y) as j —> oo. How can this 
be used to compute the likelihood function L(9 |y)? 


[Using the functions L(9 |y, z) and k(z\9, y), the EM algorithm will get us the ML 
estimator from L{9 |y), whereas the Gibbs sampler will get us the entire function. This 
likelihood implementation of the Gibbs sampler was used by Casella and Berger (1994) 
and is also described by Smith and Roberts (1993). A version of the EM algorithm, 
where the Markov chain connection is quite apparent, was given by Baum and Petrie 
(1966) and Baum et al. (1970).] 


Section 5 

5.1 (a) If a vector Y„ in E s converges in probability to a constant vector a. and if h is a 

continuous function defined over E s , show that h( Y„) -*■ h( a) in probability. 

(b) Use (a) to show that the elements of ||A^„|| _1 tend in probability to the elements 
of B as claimed in the proof of Lemma 5.2. 

[Hint: (a) Apply Theorem 1.8.19 and Problem 1.8.13.] 

5.2 (a) Show that (5.26) with the remainder term neglected has the same form as (5.15) 

and identify the A . 

(b) Show that the resulting of Lemma 5.2 are the same as those of (5.23). 

(c) Show that the remainder term in (5.26) can be neglected in the proof of Theorem 
5.3. 
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5.3 Let X 1 ,...,X B be iid according to 1V(£, a 2 ). 

(a) Show that the likelihood equations have a unique root. 

(b) Show directly (i.e., without recourse to Theorem 5.1) that the MLEs f and a are 
asymptotically efficient. 

5.4 Let (Xo,.... X s ) have the multinomial distribution M(po, ..., p s ',n). 

(a) Show that the likelihood equations have a unique root. 

(b) Show directly that the MLEs p, are asymptotically efficient. 

5.5 Prove Corollary 5.4. 

5.6 Show that there exists a function / of two variables for which the equations 3 f(x ,y)/dx 
0 and 3 f(x, y)/8y = 0 have a unique solution, and this solution is a local but not a global 
maximum of /. 


Section 6 

6.1 In Example 6.1, show that the likelihood equations are given by (6.2) and (6.3). 

6.2 In Example 6.1, verify Equation (6.4). 

6.3 Verify (6.5). 

6.4 If 9 = (6 \, ..., 9 r , 9 r + 1 , ,9 S ) and if 


3 3 

— L(0), - L(8) 

3 9 t 86j 


for any i < r < j, 


then the asymptotic distribution of (9 l , , 9 r ) under the assumptions of Theorem 5.1 

is unaffected by whether or not 9 r + 1 , ..., 9 S are known. 

6.5 Let Xi, ..., X n be iid from a f(a, fi) distribution with density l/(T(a)/0“) x .r“~ l 


(a) Calculate the information matrix for the usual (a, /3) parameterization. 

(b) Write the density in terms of the parameters (a, p.) = (a.a/fi). Calculate the 
information matrix for the (a, p) parameterization and show that it is diagonal, 
and, hence, the parameters are orthogonal. 

(c) If the MLE’s in part (a) are (a, ji ), show that (a, p) = {a, a/fi). Thus, either model 
estimates the mean equally well. 


(For the theory behind, and other examples of, parameter orthogonality, see Cox and 
Reid 1987.) 

6.6 In Example 6.4, verify the MLEs and when the £’s are unknown. 

6.7 In Example 6.4, show that the 5 /( . given by (6.15) are independent of (Xi, ..., X p ) 

and have the same joint distribution as the statistics (6.13) with n replaced by n — 1. 
[Hint: Subject each of the p vectors (X,-,,.... X in ) to the same orthogonal transforma¬ 
tion, where the first row of the orthogonal matrix is ( ..., \/^fn).[ 

6.8 Verify the matrices (a) (6.17) and (b) (6.18). 

6.9 Consider the situation leading to (6.20), where (X, , Ij-), i = 1,..., n, are iid according 
to a bivariate normal distribution with E(X t ) = E{Y t ) = 0, var(X ; ) = var(lj-) = 1, and 
unknown correlation coefficient p. 
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(a) Show that the likelihood equation is a cubic for which the probability of a unique 
root tends to 1 as n —> oo. [Hint: For a cubic equation ax 3 + 3 bx 2 + 3c.v + d = 0, 
let G = a 2 d — 3 abc + 2b 3 and H = ac — b 2 . Then the condition for a unique real 
root is G 2 + 4 H 3 > 0.] 

(b) Show that if p„ is a consistent solution of the likelihood equation, then it satisfies 
(6.20). 

£ 

(c) Show that S = HXjYj/n is a consistent estimator of p and that *Jn (5 — p) —> 
N( 0, 1 + p 2 ) and, hence, that S is less efficient than the MLE of p. 

6.10 Verify the limiting distribution asserted in (6.21). 

6.11 Let X ... ,X n be iid according to the Poisson distribution P(X). Find the ARE of 
5 2 „ = [No. of Xj = 0]/t? to (5,„ = e~ x " as estimators of e~ x . 

6.12 Show that the efficiency (6.27) tends to 0 as |<7 — #| —>■ oo. 

6.13 For the situation of Example 6.9, consider as another family of distributions, the 
contaminated normal mixture family suggested by Tukey (1960) as a model for obser¬ 
vations which usually follow a normal distribution but where occasionally something 
goes wrong with the experiment or its recording, so that the resulting observation is a 
gross error. Under the Tukey model , the distribution function takes the form 

F,. e (/) = (l-e)«>(O + e®(0. 

That is, in the gross error cases, the observations are assumed to be normally distributed 
with the same mean 9 but a different (larger) variance r 2 . 9 

(a) Show that if the A,’s have distribution F z <i (x — 9). the limiting distribution of S 2 „ 
is unchanged. 

(b) Show that the limiting distribution of Si„ is normal with mean zero and variance 

TU (1-e+er 2 )- 

(c) Compare the asymptotic relative efficiency of Si„ and S 2 „. 

6.14 Let X u ..., X„ be iid as N(0, a 2 ). 

(a) Show that S„ = kT,\Xj\/n is a consistent estimator of a if and only if k = *Jn/2. 

(b) Determine the ARE of 5 with k = *Jn/2 with respect to the MLE jY.Xf/n. 

6.15 Let Xi . X„ be iid with E(X,) = 9, var(A ; ) = 1, and E(X t - 9) 4 = fi 4 , and 

consider the unbiased estimators <5i„ = (l/n)SZ? — 1 and S 2n = X 2 — 1/n of G 2 . 

(a) Determine the ARE e 2 ,i of S 2 „ with respect to <5i„. 

(b) Show that e 2 ,i > 1 if the X f are symmetric about 9. 

(c) Find a distribution for the X t for which e 2 .i < 1. 

6.16 The property of asymptotic relative efficiency was defined (Definition 6.6) for es¬ 
timators that converged to normality at rate *Jn. This definition, and Theorem 6.7, can 
be generalized to include other distributions and rates of convergence. 

9 As has been pointed out by Stigler ( 1973 ) such models for heavy-tailed distributions had already 
been proposed much earlier in a forgotten work by Newcomb ( 1882 , 1886 ). 
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Theorem 9.1 Let {5;,,} be two sequences of estimators of g(9) such that 
n a [S in - g(9)] 4 t.T, a > 0, r, > 0, 1 = 1, 2, 

where the distribution H of T has support on an interval —oo < A < B < oo with 
strictly increasing cdfon (A, B). Then , the ARE off 52,,} with respect to {<5i„} exists and 
is 


e 2 i 


: lim 

« 2—>00 


ni(n 2 ) 

n 2 


-|1 /a 


6.17 In Example 6.10, show that the conditions of Theorem 5.1 are satisfied. 


Section 7 


7.1 Prove Theorem 7.1. 

7.2 For the situation of Example 7.3 with m = tv. 

(a) Show that a necessary condition for (7.5) to converge to N( 0, 1) is that *Jn{X—X) 
0, where X = a 1 lx 1 and X = a 2 /x 2 , for a 2 and f 2 of (7.4). 

(b) Use the fact that X/X has an F-distribution to show that «Jn(X — X) -f> 0. 

(c) Show that the full MLE is given by the solution to 


? = 


(, m/a 2 )X + ( n/x 2 )Y 
m/a 2 + n/x 2 


- E ( X ; -?) 2 

m 


-V(Yj -£) 2 
n 


and deduce its asymptotic efficiency from Theorem 5.1. 

7.3 In Example 7.4, determine the joint distribution of (a) (a 2 , x 2 ) and (b) (a 2 , ajf). 

7.4 Consider samples (Xi, Fj), .... ( X m , Y m ) and (X \, Fj), ..., (X' n , Y' n ) from two bi¬ 
variate normal distributions with means zero and variance-covariances (a 2 , r 2 , pax) 
and (a' 2 , r' 2 , p'a'x'), respectively. Use Theorem 7.1 and Examples 6.5 and 6.8 to find 
the limit distribution 


(a) of a 2 and f 2 when it is known that p' = p 

(b) of p when it is known that a' = a and r' = r. 

7.5 In the preceding problem, find the efficiency gain (if any) 

(a) in part (a) resulting from the knowledge that p' = p 

(b) in part (b) resulting from the knowledge that a' = a and r' = r. 

7.6 Show that the likelihood equations (7.11) have at most one solution. 

7.7 In Example 7.6, suppose that p, = 1 — F(a + fit/) and that both log F{x) and log[l — 
F(x )] are strictly concave. Then, the likelihood equations have at most one solution. 

7.8 (a) If the cdf F is symmetric and if log F(x) is strictly concave, so is log[l — F(x)]. 
(b) Show that log F{x) is strictly concave when F is strongly unimodal but not when 

F is Cauchy. 

7.9 In Example 7.7, show that F„ is less informative than F. 

[Hint: Let Z„ be distributed as P(XY.°f n+l y i ) independently of F„. Then, F„ + Z„ is a 
sufficient statistic for X on the basis of (F„, Z„) and F„ + Z„ has the same distribution as 
Y.] 

7.10 Show that the estimator S„ of Example 7.7 satisfies (7.14). 

7.11 Find suitable normalizing constants for S„ of Example 7.7 when (a) )/,- = i, (b) 
Yi = i 2 , and (c) y,- = l/i. 
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7.12 Let Xj (i = I, n) be independent normal with variance 1 and mean /St, (with t t 
known). Discuss the estimation of f along the lines of Example 7.7. 

7.13 Generalize the preceding problem to the situation in which (a) E(X t ) = a + /Sf, and 
var(Y,) = 1 and (b) E(X t ) = a + /St; and var(Y)) = a 2 where a, /S, and a 1 are unknown 
parameters to be estimated. 

7.14 Let Y; (j = 1. ri) be independently distributed with densities fj(xj\8) (8 real¬ 

valued), let Ij(8 ) be the information Xj contains about 8, and let T n (8 ) = S" =1 /^(0) be 
the total information about 8 in the sample. Suppose that 8„ is a consistent root of the 
likelihood equation L'(8) = 0 and that, in generalization of (3.18)-(3.20), 




L\8 0 ) -*• N (0, 1) 


and 


Show that 


L"(8 0 ) 
T„(8 0 ) "*■ 


L"\8*) 

and - is bounded in probability. 

T„(8 0 ) 


jT n (8oW* ~ 0o) 4 N( 0, 1). 


7.15 Prove that the sequence Xy, X 2 ,. .. of Example 7.8 is stationary provided it satisfies 
(7.17). 

7.16 (a) In Example 7.8, show that the likelihood equation has a unique solution, that it 

is the MLE, and that it has the same asymptotic distribution as S' = 3f,-3f i+ i/ 

e;Li xf. 

(b) Show directly that S' n is a consistent estimator of /S. 

7.17 In Example 7.8: 

(a) Show that for j > 1 the expected value of the conditional information (given Xj_i) 
that Xj contains about /3 is 1/(1 — /3 2 ). 

(b) Determine the information X\ contains about fi. 

7.18 When r = a in (7.21), show that the MLE exists and is consistent. 

7.19 Suppose that in (7.21), the £’s are themselves random variables, which are iid as 
N( M, y 2 )- 


(a) Show that the joint density of the (2(, K,-) is that of a sample from a bivariate normal 
distribution, and identify the parameters of that distribution. 

(b) In the model of part (a), find asymptotically efficient estimators of the parameters 
fi, y , /L o, and r. 

7.20 Verify the roots (7.22). 

7.21 Show that the likelihood (7.21) is unbounded. 

7.22 Show that if p s defined by (7.24), then p and p' are everywhere continuous. 

7.23 Let F have a differentiable density / and let f f < oo. 

(a) Use integration by parts to write the denominator of (7.27) as [/ \l/(x)f'(x)dx] 2 . 

(b) Show that a 2 (F, \jf) > [f (/'//) 2 /] _1 = I ] 1 by applying the Schwarz inequality 
to part (a). 


The following three problems will investigate the technical conditions required for the 
consistency and asymptotic normality ofM-estimators, as noted in (7.26). 
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7.24 To have consistency of M -estimators, a sufficient condition is that the root of the 
estimating function be unique and isolated. Establish the following theorem. 

Theorem 9.2 Assume that conditions (A0)-(A3) hold. Let to be an isolated root of the 
equation Eg 0 [xjHfX, 0] = 0, where x/sf, t) is monotone in t and continuous in a neigh¬ 
borhood of to. If Tq(x) is a solution to Xa=i 0 = 0, then To converges to to in 
probability. 

[Hint: The conditions on xj/ imply that Eg 0 [x[r(X, f)] is monotone, so to is a unique root. 
Adapt the proofs of Theorems 3.2 and 3.7 to complete this proof.] 

7.25 Theorem 9.3 Under the conditions of Theorem 9.2, if, in addition 

(i) Eg 0 [^i Js(X, f)lr=/ 0 ] is finite and nonzero, 

(ii) Eg 0 [fi 2 (X, r 0 )] < oo, 

then 

MTo ~ to) 4 N( 0, ff*), 

where a ^ = Eg 0 [xf 2 (X, f 0 l] /(Eg 0 [j t f{X, t) |,=, 0 ]) 2 . 

[Note that this is a slight generalization of (7.27).] 

[Hint: The assumptions on i// are enough to adapt the Taylor series argument of Theorem 
3.10. where \[/ takes the place of /'.] 

7.26 For each of the following estimates, write out the \jr function that determines it, and 
show that the estimator is consistent and asymptotically normal under the conditions of 
Theorems 9.2 and 9.3. 

(a) The least squares estimate, the minimizer of XX-’fi — 0 2 - 

(b) The least absolute value estimate, the minimizer of I-*/ — li¬ 
fe) The Huber trimmed mean, the minimizer of (7.24). 

7.27 In Example 7.12, compare (a) the asymptotic distributions of § and S n ; (b) the 
normalized expected squared error of | and <5„. 

7.28 In Example 7.12, show that (a) n(i — b) 4 N(0, b 2 ) and (b) nib — b) 4 
N( 0, b 2 ). 

7.29 In Example 7.13, show that 

(a) c and a are independent and have the stated distributions; 

(b) and E logf^/X^)] are complete sufficient statistics on the basis of a sample 
from (7.33). 

7.30 In Example 7.13, determine the UMVU estimators of a and c, and the asymptotic 
distributions of these estimators. 

7.31 In the preceding problem, compare (a) the asymptotic distribution of the MLE 
and the UMVU estimator of c; (b) the normalized expected squared error of these two 
estimators. 

7.32 In Example 7.15, (a) verify equation (7.39), (b) show that the choice a = —2 
produces the estimator with the best second-order efficiency, (c) show that the limiting 
risk ratio of the MLE (a = 0) to S n (a = —2) is 2, and (d) discuss the behavior of this 
estimator in small samples. 

7.33 Let Xi,..., X n be iid according to the three-parameter lognormal distribution 
(7.37). Show that 
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(a) 

P*(x|f) = supp(x|f, y, a 2 ) = c/[o-(?)]"n[l/(jCj - ?)] 

}/,< 7 2 

where 

n 

p(xI?, K, ff 2 ) = n /(■*; |f. O' 2 ), 

i=l 

ff 2 (f) = -E[log(x, - f) - K(f )] 2 and ?($) = -S logfc - ?). 

(b) p*(x|f) -► oo as f -> x (1) . 

[Hint: (b) For f sufficiently near arm, 

ff 2 (f) < — E[log(jCf - f )] 2 < [log(x (1 ) - f )] 2 

n 

and hence 

p*(x|?) > | iog(x (1) - ?)pnc* (i) - f)- 1 . 

The right side tends to infinity as f —> x ( ij (Hill 1963.] 

7.34 The derivatives of all orders of the density (7.37) tend to zero as ,r -» f . 


Section 8 

8.1 Determine the limit distribution of the Bayes estimator corresponding to squared 
error loss, and verify that it is asymptotically efficient, in each of the following cases: 

(a) The observations Xi, ..., X„ are iid N{6, a 2 ), with a known, and the estimand is 
9. The prior distribution for © is a conjugate normal distribution, say N(p, b 2 ). 
(See Example 4.2.2.) 

(b) The observations T, have the gamma distribution T(y, 1/r), the estimand is 1/r, 
and r has the conjugate prior density T(g, o'). 

(c) The observations and prior are as in Problem 4.1.9 and the estimand is X. 

(d) The observations Y t have the negative binomial distribution (4.3), p has the prior 
density B(a , b), and the estimand is (a) p and (b) l/b. 


8.2 Referring to Example 8.1, consider, instead, the minimax estimator S„ of p given by 
(1.11) which corresponds to the sequence of beta priors with a = b = *Jn/2. Then, 


Vn[S„ 






(a) Show that the limit distribution of *Jn[&„ — p] is N[\ — p, p(l — p)\, so that S„ 
has the same asymptotic variance as X/n, but that for p ^, it is asymptotically 
biased. 

(b) Show that ARE of S n relative to X/n does not exist except in the case p = \ when 
it is 1. 


8.3 The assumptions of Theorem 2.6 imply (8.1) and (8.2). 

8.4 In Example 8.5, the posterior density of 6 after one observation is fix \ — 9 ); it is a 
proper density, and it satisfies (B5) provided Eg\X\ \ < oo. 

8.5 Let Xi, ..., X n be independent, positive variables, each with density (1/t)/(x;/t), 
and let r have the improper density n(z) = 1/r (r > 0). The posterior density after one 
observation is a proper density, and it satisfies (B5), provided E r (l/Xi) < oo. 
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8.6 Give an example in which the posterior density is proper (with probability 1) after 
two observations but not after one. 

[Hint: In the preceding example, let n(z) = 1/t 2 .] 

8.7 Prove the result stated preceding Example 8.6. 

8.8 LetXi. X„ beiidas N{6, 1) and consider theimproper density tc{ 9) = e .Then, 

the posterior will be improper for all n. 

8.9 Prove Lemma 8.7. 

8.10 (a) If sup|F„(f)| 0 and sup|Y„(f) — c| —>■ 0 as n -*■ oo, then sup|Y„(t) — 
ce Y, ‘ {,) \ —> 0, where the sup is taken over a common set t e T. 

(b) Use (a) to show that (8.22) and (8.23) imply (8.21). 

8.11 Show that (Bl) implies (a) (8.24) and (b) (8.26). 

10 Notes 

10.1 Origins 

The origins of the concept of maximum likelihood go back to the work of Lambert, 
Daniel Bernoulli, and Lagrange in the second half of the eighteenth century, and of 
Gauss and Laplace at the beginning of the nineteenth. (For details and references, see 
Edwards 1974 or Stigler 1986.) The modern history begins with Edgeworth (1908. 1909) 
and Fisher (1922, 1925), whose contributions are discussed by Savage (1976) and Pratt 
(1976). 

Fisher’s work was followed by a euphoric belief in the universal consistency and asymp¬ 
totic efficiency of maximum likelihood estimators, at least in the iid case. The true situa¬ 
tion was sorted out only gradually. Landmarks are Cramer (1946a, 1946b), who shifted 
the emphasis from the global to a local maximum and defined the “regular” case in which 
the likelihood equation has a consistent asymptotically efficient root; Wald (1949), who 
provided fairly general conditions for consistency; the counterexamples of Hodges (Le 
Cam, 1953) and Bahadur (1958); and Le Cam’s resulting theorem on superefficiency 
(1953). 

Convergence (under suitable restrictions and appropriately normalized) of the posterior 
distribution of a real-valued parameter with a prior distribution to its normal limit was 
first discovered by Laplace (1820) and later reobtained by Bernstein (1917) and von 
Mises (1931). More general versions of this result are given in Le Cam (1958). The 
asymptotic efficiency of Bayes solutions was established by Le Cam (1958), Bickel 
and Yahav (1969), and Ibragimov and Has’minskii (1972). (See also Ibragimov and 
Has’minskii 1981.) 

Computation of likelihood estimators was influenced by the development of the EM 
Algorithm (Dempster, Laird, and Rubin 1977). This algorithm grew out of work done on 
iterative computational methods that were developed in the 1950s and 1960s, and can be 
traced back at least as far as Hartley (1958). The EM algorithm has enjoyed widespread 
use as a computational tool for obtaining likelihood estimators in complex problems 
(see Little and Rubin 1987, Tanner 1996, or McLachlan and Krishnan 1997). 

10.2 Alternative Conditions for Asymptotic Normality 

The Cramer conditions for asymptotic normality and efficiency that are given in Theo¬ 
rems 3.10 and 5.1 are not the most general; for those, see Strasser 1985, Pfanzagl 1985, 
or LeCam 1986. They were chosen because they have fairly wide applicability, yet are 



516 


ASYMPTOTIC OPTIMALITY 


[6.10 


relatively straightforward to verify. In particular, it is possible to relax the assumptions 
somewhat, and only require conditions on the second, rather than third, derivative (see 
Le Cam 1956, Hajek 1972, and Inagaki 1973). These conditions, however, are somewhat 
more involved to check than those of Theorem 3.10, which already require some effort. 
The conditions have also been altered to accommodate specific features of a problem. 
One particular change was introduced by Daniels (1961) to overcome the nondifferentia¬ 
bility of the double exponential distribution (see Example 3.14). Huber (1967) notes an 
error in Daniels proof; however, the validity of the theorem remains. Others have taken 
advantage of the form of the likelihood. Berk (1972b) exploited the fact that in expo¬ 
nential families, the cumulant generating function is convex. This, in turn, implies that 
the log likelihood is concave, which then leads to simpler conditions for consistency and 
asymptotic normality. Other proofs of existence and consistency under slightly different 
assumptions are given by Foutz (1977). Consistency proofs in more general settings 
were given by Wald (1949), Le Cam (1953), Bahadur (1967), Huber (1967), Perlman 
(1972), and Ibragimov and Has'minskii (1981), among others. See also Pfanzagl 1969, 
1994, Landers 1972, Pfaff 1982, Wong 1992, Bickel et al. 1993, and Note 10.4. Another 
condition, which also eliminates the problem of superefficiency, is that of local asymp¬ 
totic normality (Le Cam 1986, Strasser 1985, Section 81, LeCarn and Yang 1990, and 
Wong 1992.) 

10.3 Measurability and Consistency 

Theorems 3.7 and 4.3 assert the existence of a consistent sequence of roots of the 
likelihood equation, that is, a sequence of roots that converges in probability to the true 
parameter value. The proof of Theorem 3.7 is a modification of those of Cramer (1946a, 
1946b) and Wald (1949), where the latter established convergence almost everywhere of 
the sequence. In almost all cases, we are taught, convergence almost everywhere implies 
convergence in probability, but that is not so here because a sequence of roots need not be 
measurable! Happily, the 9* of Theorem 3.7 are measurable (however, those of Theorem 
4.3 are not necessarily). Serfling (1980, Section 4.2.2; see also Problem 3.29), addresses 
this point, as does Ferguson (1996, Section 17), who also notes that nonmeasurability 
does not preclude consistency. (We thank Professor R. Wijsman for alerting us to these 
measurability issues.) 

10.4 Estimating Equations 

Theorems 9.2 and 9.3 use assumptions similar to the original assumptions of Huber 
(1964, 1981, Section 3.2). Alternate conditions for consistency and asymptotic normal¬ 
ity, which relax some smoothness requirements on p, have been developed by Boos 
(1979) and Boos and Serfling (1980); see also Serfling 1980, Chapter 7, for a detailed 
development of this topic. Further results can be found in Portnoy (1977a, 1984, 1985) 
and the discrete case is considered by Simpson, Carroll, and Ruppert (1987). 

The theory of M-estimation, in particular results such as (7.26), have been generalized 
in many ways. In doing so, much has been learned about the properties of the functions 
p and x]/ = p' needed for the solution 9 to the equation V 7 (•*; — 8) = 0 to have 
reasonable statistical properties. 

For example, the structure of the exponential family can be exploited to yield less re¬ 
strictive conditions for consistency and asymptotic efficiency of 8. In particular, the 
concavity of the log likelihood plays an important role. Haberman (1989) gives a com¬ 
prehensive treatment of consistency and asymptotic normality of estimators derived 
from maximizing concave functions (which include likelihood and M-estimators). 
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This approach to constructing estimators has become known as the theory of estimating 
functions [see, for example, Godambe 1991 or the review paper by Liang and Zeger 
(1994)]. A general estimating equation has the form h(Xj\0) = 0, and consis¬ 
tency and asymptotic normality of the solution 9 can be established under quite gen¬ 
eral conditions (but also see Freedman and Diaconis 1982 or Lele 1994 for situations 
where this can go wrong). For example, if the estimating equation is unbiased, so that 
Eg [^" =| h(Xi\6)\ = 0 for all 9, then the “usual’" regularity conditions (such as those 
in Problems 7.24 - 7.25 or Theorem 3.10) will imply that 9 is consistent. Asymptotic 
normality will also often follow, using a proof similar to that of Theorem 3.10, where 
the estimating function h is used instead of the log likelihood l. Carroll, Ruppert, and 
Stefanski (1995, Appendix A.3) provide a nice introduction to this topic. 

10.5 Variants of Likelihood 

A large number of variants of the likelihood function have been proposed. Many started as 
a means of solving a particular problem and, as their usefulness and general effectiveness 
was realized, they were generalized. Although we cannot list all of these variants, we 
shall mention a few of them. 

The first modifications of the usual likelihood function are primarily aimed at dealing 
with nuisance parameters. These include the marginal, conditional, and profile likeli¬ 
hoods, and the modified profile likelihood of Bamdorff-Nielsen (1983). In addition, many 
of the modifications are accompanied by higher-order distribution approximations that 
result in faster convergence to the asymptotic distribution. These approximations may 
utilize techniques of small-sample asymptotics (conditioning on ancillaries, saddlepoint 
expansions) or possibly Bartlett corrections (Bamdorff-Nielsen and Cox 1984). 

Other modifications of likelihood may entail, perhaps, a more drastic variation of the 
likelihood function. The partial likelihood of Cox (1975; see also Oakes 1991), presents 
an effective means of dealing with censored data, by dividing the model into parametric 
and nonparametric parts. Along these lines quasi-likelihood (Wedderburn 1974, Mc¬ 
Culloch and Nelder 1989, McCulloch 1991) is based only on moment assumptions and 
empirical likelihood (Owen 1988, 1990, Hall and La Scala 1990) is a nonparametric 
approach based on a multinomial profile likelihood. 

There are many other variations of likelihood, including directed, penalized, and ex¬ 
tended, and the idea of predictive likelihood (Hinkley 1979, Butler 1986, 1989). 

An entry to this work can be obtained through Kalbfleisch (1986), Bamdorff-Nielsen 
and Cox (1994), or Edwards (1992), the review articles of Hinkley (1980) and Bjprnstad 
(1990), or the volume of review articles edited by Hinkley, Reid, and Snell (1991). 

10.6 Boundary Values 

A key feature throughout this chapter was the assumption that the tme parameter point 9 0 
occurs at an interior point of the parameter space (Section 6.3, Assumption A3; Section 
6.5, Assumption A). The effect of this assumption is that, for large n, as the likelihood 
estimator gets close to 9 0 , the likelihood estimator will, in fact, be a root of the likelihood. 
(Recall the proofs of Theorems 3.7 and 3.10 to see how this is used.) However, in some 
applications 9q is on the boundary of the parameter space, and the ML estimator is not 
a root of the likelihood. This situation is more frequently encountered in testing than 
estimation, where the null hypothesis Hq : 9 = 9o often involves a boundary point. 
However, boundary values can also occur in point estimation. For example, in a mixture 
problem (Example 6.10), the value of the mixing parameter could be the boundary value 
0 or 1. Chernoff (1954) first investigated the asymptotic distribution of the maximum 
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likelihood estimator when the parameter is on the boundary. This distribution is typically 
not normal, and is characterized by Self and Liang (1987), who give many examples, 
ranging from multivariate normal to mixtures of chi-squared distributions to even more 
complicated forms. 

An alternate approach to establishing the limiting distribution is provided by Feng and 
McCulloch (1992), who use a strategy of expanding the parameter space. 

10.7 Asymptotics of REML 

The results of Cressie and Lahiri (1993) and Jiang (1996, 1997) show that when using 
restricted maximum likelihood estimation (REML; see Example 2.7 and the discussion 
after Example 5.3) instead of ML, efficiency need not be sacrificed, as the asymptotic 
covariance matrix of the REML estimates is the inverse information matrix from the 
reduced problem. More precisely, we can write the general linear mixed model (gener¬ 
alizing the linear model of Section 3.4) as 

(10.1) Y = XP + Zu + e, 

where Y is the N x 1 vector of observations, X and Z are N x p design matrices, /? is 
the p x 1 vector of fixed effects, u ~ N( 0, D) is the p x 1 vector of random effects, 
and s ~ N(0, R), independent of u. The variance components of D and R are usually 
the targets of estimation. The likelihood function L(p. u, D, R|y) is transformed to the 
REML likelihood by marginalizing out the P and u effects, that is, 

L(D,R\y) = J J L(p. u, D, R|y) da dp. 

Suppose now that V = V(8), that is, the vector 6 represents the variance components 
to be estimated. We can thus write L(D, R|y) = L(V(0 )|y) and denote the information 
matrix of the marginal likelihood by 1^(8). Cressie and Lahiri (1993, Corollary 3.1) 
show that under suitable regularity conditions, 

[i N (8)] l/2 (o-e) 4 a«0,7), 

where 8 N maximizes L( V(8 )[y). Thus, the REML estimator is asymptotically efficient. 
Jiang (1996, 1997) has extended this result, and established the asymptotic normality of 
8 even when the underlying distributions are not normal. 

10.8 Higher- Order Asymptotics 

Typically, not only is the MLE asymptotically efficient, but so also are various approxi¬ 
mations to the MLE, to Bayes estimators, and so forth. Therefore, it becomes important 
to be able to distinguish between different asymptotically efficient estimator sequences. 
For example, it seems plausible that one would do best in any application of Theorem 
4.3 by using a highly efficient yTi-consistent starting sequence. It has been pointed out 
earlier that an efficient estimator sequence can always be modified by terms of order 1 /n 
without affecting the asymptotic efficiency. Thus, to distinguish among them requires 
taking into account the terms of the next order. 

A number of authors (among them Rao 1963, Pfanzagl 1973. Ghosh and Subramanyam 
1974, Efron 1975, 1978, Akahira and Takeuchi 1981, and Bhattacharya and Denker 
1990) have investigated estimators that are “second-order efficient,” that is, efficient and 
among efficient estimators have the greatest accuracy to terms of the next order, and in 
particular these authors have tried to determine to what extent the MLE is second-order 
efficient. For example, Efron (1975, Section 10) shows that in exponential families, the 
MLE minimizes the coefficient of the second-order term among efficient estimators. 
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For the most part, however, the asymptotic theory presented here is “first-order” theory 
in the sense that the conclusion of Theorem 3.10 can be expressed as saying that 

M £lf ) = z + o(±), 

where Z is a standard normal random variable, so the convergence is at rate 0(l/n l/2 ). 
It is possible to reduce the error in the approximation to 0(l/n i/2 ) using “higher- 
order” asymptotics. The book by Barndorff-Nielsen and Cox (1994) provides a detailed 
treatment of higher-order asymptotics. Other entries into this subject are through the 
review papers of Reid (1995, 1996) and a volume edited by Hinkley, Reid, and Snell 
(1991). 

Another technique that is very useful in obtaining accurate approximations for the densi¬ 
ties of statistics is the saddlepoint expansion (Daniels 1980, 1983), which can be derived 
through inversion of a characteristic function or through the use of Edgeworth expan¬ 
sions. Entries to this literature can be made through the review paper of Reid (1988), the 
monograph by Kolassa (1993), or the books by Field and Ronchetti (1990) or Jensen 
(1995). 

Still another way to achieve higher-order accuracy in certain cases is through a technique 
known as the bootstrap, initiated by Efron (1979, 1982b). Some of the theoretical foun¬ 
dations of the bootstrap are rooted in the work of von Mises (1936, 1947) and Kiefer and 
Wolfowitz (1956). The bootstrap can be thought of as a “nonparametric” MLE, where the 
quantity f h(x)dF(x ) is estimated by f h(x)dF n (x). Using the technique of Edgeworth 
expansions, it was established by Singh (1981) (see also Bickel and Freedman 1981) 
that the bootstrap sometimes provides a more accurate approximation than the Delta 
Method (Theorem 1.8.12). An introduction to the asymptotic theory of the bootstrap 
is given by Lehmann (1999), and implementation and applications of the bootstrap are 
given in Efron and Tibshirani (1993). Other introductions to the bootstrap are through 
the volume edited by LePage and Billard (1992), the book by Shao and Tu (1995), or 
the review paper of Young (1994). A more theoretical treatment is given by Hall (1992). 
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in, 450, 470 

mean-value parameter for, 116, 
126 

minimal sufficient statistic for, 
39 

minimax estimation in, 322 
moments and cumulants for, 28, 
28 

natural parameter space of, 24 
prior distributions, 236 
relative to group families, 32 
unbiased estimation in, 88 
which is a location or scale 
family, 32, 41 

Exponential linear model, 193, 223 
Exponential location family, 32 
Exponential one- and two-sample 
problem, 98, 133, 153, 175, 208, 
485 

Exponential scale family, 32 

Factorial experiment, 184 
Factorization criterion, for 
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sufficient statistics, 35 
Family of distributions, see Discrete 
location. Dominated, 

Exponential, Group, Location, 
Location-scale, Nonparametric, 
Scale, Two-sample location 
Latou’s Lemma, 11 
Linite group, 252, 338, 344 
Linite population model, 22, 198, 
224, 305, 427 

Linite binomial sampling plan, 103 
Lirst order ancillary, 41 
Lisher information, 115, 144, 424 
additivity of, 119 
total, 610. See also Information 
matrix 

Lixed effects, 187 
Lormal invariance, 161, 209, 223 
Lrame, in survey sampling, 198 
Lrequentist, 2, 421 
Lunctional equivariance, 161, 209, 
223 

Lubini’s theorem, 13, 78 
Lull exponential model, 79 
Lull linear group, 224 
Lull-rank exponential family, 24 
Lunction, see Absolutely 

continuous. Concave, Convex, 
Digamma, Hypergeometric, 
Incomplete Beta, Strongly 
differentiable. Subharmonic, 
Superharmonic, Trigamma, 
Weakly differentiable 

Gamma distribution, 25, 67 
conjugate of, 245 
as exponential scale family, 32 
Lisher information for, 117, 127 
moments and cumulants of, 30 
as prior distribution, 236, 240, 
254, 257, 268, 277 
Gamma-minimax, 307, 389 
Gauss-Markov theorem, 184, 220. 

See also Least squares 
General linear group, 224, 422 


General linear model, see Normal 
linear model 

General linear mixed model, 518 
Generalized Bayes estimator, 239, 
284, 315, 383 

Generalized linear model, 197, 305 
Geometric distribution, 134 
Gibbs sampler, 256, 291, 305, 508 
GLIM, 198 
Gradient V, 80 
Group, 19, 159 
Abelian, 247 
amenable, 422 
commutative, 19, 165 
finite, 338 
full linear, 224 
general linear, 224, 422 
invariant measure over, 247 
location, 247, 250 
location-scale, 248, 250 
orthogonal, 348 
scale, 248, 250 
transformation, 19 
triangular, 65 

Group family, 16, 17, 32, 65, 68, 
163, 165 

Grouped observations, 455 

Haar measure, 247, 287, 422 
Hammersley-Chapman-Robbins 
inequality, 114 
Hardy-Weinberg model, 220 
Hazard function, 140, 144 
Hessian matrix, 49, 73 
Hidden Markov chain, 254 
Hierarchical Bayes, 230, 253, 260, 
268 

compared to empirical Bayes, 
264 

Higher order asymptotics, 518 
Horvitz-Thompson estimator, 222 
Huber estimator, 484 
Huber loss function, 52 
Hunt-Stein theorem, 421 
Hypergeometric function, 97 
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Hypergeometric distribution, 320 
Hyperparameter, 227, 254 
Hyperprior, 230, choice of, 269 

Idempotent, 367 
Identity Transformation, 19 
i.i.d. (identically independently 
distributed), 4 
Improper prior, 238 
Inadmissibility, 48, 324 

of James-Stein estimator, 356, 
357, 377 

of minimax estimator, 327, 418 
of MRE estimator, 334 
of normal vector mean, 351, 352, 
of positive-part estimator, 377 
in presence of nuisance 
parameters, 334, 342 
of pre-test estimator, 351, 352 
of UMVU estimator, 99. See 
also Admissibility 
Incidental parameters, 481, 482 
Incomplete beta function, 219 
Independence, conditional, 108, 195 
Independent experiments, 195, 349, 
374. See also Simultaneous 
estimation 

Indicator (Ia) of a set A, 9 
Inequality 

Bhattacharya, 128 
Chebyshev, 55, 75 
covariance, 113, 144, 370 
Cramer-Rao, 136, 143 
differential, 420 

Hammersley-Chapman-Robbins, 

114 

information, 113, 120, 123, 127, 
144, 325 

lensen, 47, 52, 460 
Kiefer, 140 
Schwarz, 74, 130 

Information, in hierarchy, 260. See 
also Fisher information 
Information bound, attainment of, 
121, 440 


Information inequality, 113, 120, 
144, 144 

attainment of, 121 
asymptotic version of, 439 
geometry of, 144 
multiparameter, 124, 127, 462 
in proving admissibility, 325, 
420. See also the following 
inequalities: Bhattacharya, 
Cramer-Rao, Hammersley- 
Chapman-Robbins, 

Kiefer 

Information matrix, 124, 462 
Integrable, 10, 16 
Integral, 9, 10 
continuity of, 27 
by Monte Carlo, 290 
Interaction, 184, 195 
Invariance: of estimation problem, 
160 

formal, 161 

of induced measure, 250 
of loss function, 148, 160 
of measure, 247 
nonexistence of, 166 
of prior distribution, 246, 338 
of probability model, 158, 159 
and sufficiency, 156 
and symmetry, 149. See also 
Equivariance, Haar measure 
Invariant distribution of a Markov 
chain, 290, 306 

Inverse binomial sampling, 101. See 
also Negative binomial 
distribution 
Inverse cdf, 73 

Inverse Gaussian distribution, 32, 

68 

Inverted gamma distribution, 245 
Irreducible Markov chain, 306 

Jackknife, 83, 129 
James-Stein estimator, 272, 351 
Bayes risk of, 274 
Bayes robustness of, 275 
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as empirical Bayes estimator, 
273, 295, 298 
component risk, 356 
inadmissibility of, 276, 282 356 
maximum component risk of, 

353 

risk function of, 355. See also 
Positive part James-Stein 
estimator. Shrinkage estimator. 
Simultaneous estimation 
Jeffrey’s prior, 230, 234, 287, 305, 
315. See also Reference prior 
Jensen’s inequality, 46, 47, 52 

Karlin’s theorem, 331, 389, 427 
Kiefer inequality, 140 
Kullback-Leibler information, 47, 
259, 293 

Labels in survey sampling, 201, 224 
random, 201 

Laplace approximation, 270, 297 
Laplacian (V 2 /), 361 
Large deviation, 81 
Least absolute deviations, 484 
Least favorable: distribution, 310, 
420 

sequence of distributions, 316 
Least informative distribution, 153 
Least squares, 3, 178 

Gauss’ theorem on, 184, 220 
Lebesgue measure, 8, 14 
Left invariant Haar measure, 247, 
248, 250, 287 

Likelihood: conditional, 517 
empirical, 517 
marginal, 517 
partial, 517 
penalized, 517 
profile, 517 
quasi, 517 

Likelihood equation, 447, 462 
consistent root of, 447, 463 
multiple roots of, 451 
Likelihood function, 238, 444, 517 


Lim inf, 11, 63 

Limit of Bayes estimators, 239, 383 
Lim sup, 11, 63 
Limiting Bayes method (for 
proving admissibility), 325 
Limiting moment approach, 429, 
430. See also Asymptotic 
distribution approach 
Linear estimator, admissibility of, 
323, 389 

properties of, 184 
Linear minimax risk, 329 
Linear model, 176 

admissible estimators in, 329 
Bayes estimation in, 305 
canonical form for, 177 
full-rank model, 180 
generalization of, 220 
least squares estimators in, 178, 
180, 182, 184 

minimax estimation in, 392 
MRE estimation in, 178 
without normality, 184 
UMVU estimation in, 178. See 
also Normal linear model 
Link function, 197 
Lipschitz condition, 123 
LMVU, see Locally minimum 
variance unbiased estimator 
Local asymptotic normality (LAN), 
516 

Locally minimum variance 
unbiased estimator, 84, 90, 113 
Location/curved exponential family, 
41 

Location family, 17, 340, 492 
ancillary statistics for, 41 
asymptotically efficient 
estimation in, 455, 492 
circular, 339 
discrete, 344 
exponential, 32 
information in, 118 
invariance in, 158 
minimal sufficient statistics for, 
38 
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minimax estimation in, 340 
MRE estimator in, 150 
two-sample, 159 
which is a curved exponential 
family, 41 

which is an exponential family, 
32. See also Location-scale 
family. Scale family 
Location group, 247, 250 
Location invariance, 149, 149 
Location parameter, 148, 223 
Location-scale family, 17, 167 
efficient estimation in, 468 
information in, 126 
invariance in, 167 
invariant loss function for, 173 
MRE estimator for, 174 
Location-scale group, 248, 250 
Log likelihood, 444 
Log linear model, 194 
Logarithmic series distribution, 67 
Logistic distribution L(a, Z), 18, 
196, 479 

Lisher information in, 119, 139 
minimal sufficient statistics for, 
38 

Logistic regression model, 479 
Logit, 26, 196 

Logit dose-response model, 44 
Log-likelihood, 447 
Loglinear model, 194 
Lognormal distribution, 486 
Loss function, 4, 7 
absolute error, 50 
bounded, 51 
choice of, 7 
convex, 7, 45, 87, 152 
estimation of, 423 
family of, 354, 400 
invariant, 148 
multiple, 354 
non-convex, 51 
realism of, 51 
squared error, 50 
subharmonic, 53 
Lower semicontinuous, 74 


Markov chain, 55, 290, 306, 420 
Markov chain Monte Carlo 
(MCMC), 256 

Markov series, normal autogressive, 
481 

Maximum component risk, 353, 
363, 364 

Maximum likelihood estimator 
(MLE), 98,444, 467,515 
asymptotic efficiency of, 449, 
463, 482 

asymptotic normality of, 449, 
463 

bias corrected, 436 
of boundary values, 517 
comparison with Bayes 
estimator, 493 
comparison with UMVU 
estimator, 98 

in empirical Bayes estimation, 
265 

inconsistent, 445, 452, 482 
in irregular cases, 485 
measurability of, 448 
in the regular case, 515 
restricted (REML), 191, 390, 

518 

second order properties of, 518. 
See also Efficient likelihood 
estimation. Superefficiency 
Mean (population), 200, 204, 319 
nonparametric estimation of, 

110, 318. See also Normal 
mean. Common mean. 
One-sample problem 
Mean (sample), admissibility of, 
324 

consistency of, 55 
distribution in Cauchy case, 3, 62 
inadmissibility of, 327, 350, 352 
inconsistency of, 76 
optimum properties of, 3, 98, 
110, 153,200,317,318 
Mean-unbiasedness, 5, 157. See 
also Unbiasedness 
Mean-value parametrization of 
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exponential family, 116, 126 
Measurable: function, 9, 16 
set, 8, 15 

Measurable transformation, 63, 64 
Measure, 8 
Measure space, 8 
Measure theory, 7 
Measurement error models, 483 
Measurement invariance, 223 
Measurement problem, 2 
Median, 5, 62, 455 

as Bayes estimator, 228. See also 
Scale median, 212 
Median-unbiasedness, 5, 157 
M-estimator, 484, 512, 513, 516 
Method of moments, 456 
Mill’s ratio, 140 
Minimal complete class, 378 
Minimal sufficient statistic, 37, 69, 
78 

and completeness, 42, 43 
dimensionality of, 40, 79 
Minimax estimator, 6, 225, 309, 425 
characterization of, 311, 316, 

318 

and equivariance, 421 
non-uniqueness, 327 
randomized, 313 
vector-valued, 349 
with constant risk, 336 
Minimax robustness, 426 
Minimum x 2 . 479 
Minimum norm quadratic unbiased 
estimation (Minque), 192 
Minimum risk equivariant (MRE) 
estimator, 150, 162 
behavior under transformations, 
210 

comparison with UMVU 
estimator, 156 
inadmissible, 342 
in linear models, 178, 185 
in location families, 154 
in location-scale families, 171 
minimaxity and admissibility of, 
338, 342, 345 


non-unique, 164, 170 
risk unbiasedness of, 157, 165 
in scale families, 169 
under transitive group, 162 
unbiasedness of, 157 
which is not minimax, 343. See 
also Pitman estimator 
Minimum variance unbiased 
estimate, see Uniformly 
minimum variance unbiased 
estimator 
Minque, 192 
Missing data, 458 

Mixed effects model, 187, 192, 305, 
478 

Mixtures, 456 
normal, 474 

MLE, see Maximum likelihood 
estimator 

Model, see Exponential, Finite 
population. Fixed effects. 

General linear. Generalized 
linear. Hierarchical Bayes, 

Linear, Mixed effects. 

Probability, Random, Threshold, 
Tukey 

Moment generating function, 28 
of exponential family, 28 
Monotone decision problem, 414 
Monte Carlo integration, 290 
Morphometries, 213 
MRE estimator, see Minimum risk 
equivariant estimator 
Multicollinearity, 424 
Multinomial distribution 

M(po, ..., p s ; n ), 24, 27, 220 
Bayes estimation in, 349 
for contingency tables, 106, 193, 
197 

maximum likelihood estimation 
in, 194, 475, 479 
minimax estimation in, 349 
restricted, 194 

unbiased estimation in, 106, 194, 
197 

Multiple correlation coefficient, 96 
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Multiple imputation, 292 
Multi-sample problem, efficient 
estimation in, 475 
Multivariate CLT, 61 
Multivariate normal distribution, 
20,61,65,96 

information matrix for, 127 
maximum likelihood estimation 
in, 471. See also Bivariate 
normal distribution 
Multivariate normal one-sample 
problem, 96 

Natural parameter space (of 
exponential family), 24 
v^n-consistent estimator, 454, 467 
Negative binomial distribution 
Nl(p, n), 25,66, 101,375, 381 
Negative hypergeometric 
distribution, 300 
Neighborhood model, 6 
Nested design, 190 
Newton-Raphson method, 453 
Non-central j 1 distribution, 406 
Nonconvex loss, 51. See also 
Bounded loss 

Noninformative prior, 230, 305. See 
also Jeffrey’s prior. Reference 
prior 

Nonparametric density estimation, 
110, 144 

Nonparametric family, 21, 79 
complete sufficient statistic for, 
109 

unbiased estimation in, 109, 110 
Nonparametric: mean, 318 
model, 6 

one-sample problem, 110 
two-sample problem, 112 
Normal cdf, estimation of, 93 
Normal correlation coefficient, 
efficient estimation in, 472, 509 
multiple, 96 

unbiased estimation of, 96 
Normal distribution, 18, 324 


curved, 25 

empirical Bayes estimation in, 
263,266 

equivariant estimation in, 153 
as exponential family, 24, 25, 27, 
30, 32 

hierarchy, 254, 255 
as least informative, 153 
as limit distribution, 59, 442 
moments of, 30 
as prior distribution, 233, 242, 
254, 255, 258, 272 
sufficient statistics for, 36, 36, 38 
truncated, 393. See also 
Bivariate and Multivariate 
normal distribution 
Normal limit distribution, 58 
of binomial, 59 

Normal linear model, 21, 176, 177, 
329 

canonical form of, 177 
Normal mean, estimation of 
squared, 434 

Normal mean (multivariate), 20 
admissibility of, 426 
bounded, 425 

equivariant estimation of, 348 
minimax estimation of, 317. See 
also James-Stein estimator. 
Shrinkage estimation 
Normal mean (univariate): 
admissibility of, 324 
Bayes estimator of, 234 
minimax estimator of, 317 
equivariant estimator of, 153, 

174 

minimax estimator of, 317 
restricted Bayes estimator of, 

321 

restricted to integer values, 140 
truncated, 327 

unbiased estimation in, 350, 352, 
352 

Normal: mixtures, 474 
one-sample problem, 91 
probability, 93 
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probability density, 94, 97 
two-sample problem 
Normal variance: admissibility of 
estimators, 330, 334 
Bayes estimation of, 236, 237 
estimation in presence of 
incidental parameters, 482, 

483 

inadmissibility of standard 
estimator of, 334 
linear estimator of, 330 
MRE estimator of, 170, 170, 172 
UMVU estimator of, 92 
Normal vector mean, see Normal 
mean (multivariate) 

Normalizing constant, 57 
Nuisance parameters, 461 
effect on admissibility, 342 
effect on efficiency, 469. See 
also Incidental parameters 
Null-set, 14 

One-sample problem, see 

Exponential one- and two-sample 
problem, Nonparametric 
one-sample problem. Normal 
one-sample problem. Uniform 
distribution 

One-way layout, 176, 410 
em algorithm, 458 
empirical Bayes estimation for, 
278 

loss function for, 360 
random effects model for, 187, 
237, 477 
unbalanced, 181 
Optimal procedure, 2 
Orbit (of a transformation group), 
163 

Order notation (o, O. op , Op), 77 
Order statistics, 36 
sufficiency of, 36 
completeness of, 72, 109, 199 
Orthogonal: group, 348 
parameters, 469 


transformations, 177 
Orthogonal polynomials, 216 

Parameter, 1 

boundary values of, 517 
in exponential families, 245 
incidental, 482, 483 
orthogonal, 469 

structural, 481. See also Location 
parameter. Scale parameter 
Parameter invariance, 223 
Pareto distribution, 68, 486 
Partitioned matrix, 142 
Past experience, Bayes approach to, 
226 

Periodic Markov chain, 306 
Pitman estimator, 154, 155 
admissibility of, 156, 342 
asymptotic efficiency of, 492 
as Bayes estimator, 250, 252, 

397 

minimaxity of, 340 
Point estimation, 2 
Poisson distribution, 25, 30, 35, 
121,427 

admissibility of estimators, 427 
Bayes and empirical Bayes 
estimation in, 277 
Fisher information for, 118 
hierarchy, 257, 268, 277 
minimax estimation in, 336, 372 
misbehaving UMVU estimator, 
108 

moments and cumulants for, 30 
not a group family, 65 
Stein effect in, 372, 374 
sufficient statistics for, 33, 35 
truncated, 106 
unbiased estimation in, 105 
Poisson process, 106 
Population variance, 200 
Positive part of a function, 9 
Positive part James-Stein estimator, 
276, 356 

as empirical Bayes estimator. 
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282 

inadmissible, 357, 377 
as truncated Bayes estimator, 

413 

Posterior distribution, 227, 240 
convergence to normality, 489, 
514 

for improper prior, 340, 492, 515 
Power series distribution, 67, 104 
not a group family, 166 
Prediction, 192, 220 
Pre-test estimator, 351, 352 
Prior distribution, 227 
choice of, 227, 492 
conjugate, 236, 305 
improper, 232, 238 
invariant, 246 

Jeffrey’s, 230, 234, 287, 305, 315 
least favorable, 310 
noninformative, 230, 305 
reference, 261 

Probability P(A ) of a set A, 14 
second order inclusion 
Probability density, 14 

nonexistence of nonparametric 
unbiased estimator for, 109 
Probability distribution, 14 
absolutely continuous, 14 
discrete, 14 
estimation of, 109 
Probability measure, 14 
Probability model, 3, 6 
Probit, 196, 506 
Product measure, 13 
Projection, 367 

Proportional allocation, 204, 222 
Pseudo-Bayes estimator, 405 

Quadratic estimator of variance, 

186, 192 

Radius of curvature, 81 
Radon-Nikodym derivative, 12 
Radon-Nikodym theorem, 12 


Random effects model, 187, 278, 
323, 477 

additive, 187, 189 

Bayes model for, 237 

for balanced two-way layout, 

478 

nested, 190 
prediction in, 192 
UMVU estimators in, 189, 191. 
See also Variance components 
Random observable, 4 
Random variable, 15 
Random vector, 15 
Random linear equations, 465 
Random walk, 102, 343, 398 
Randomized estimator, 33, 48 
in complete class, 378 
in equivariant estimation, 155, 
156, 162 

in minimax estimation, 313 
in unbiased estimation, 131 
Randomized response, 322, 501 
Rao-Blackwell theorem, 47, 347 
Ratio of variances, unbiased 
estimation of, 95 
Rational invariance, 223 
Recentered confidence sets, 423 
Recurrent Markov chain, 306 
Reference prior, 261 
Regression, 176, 180, 181, 280, 420 
with both variables subject to 
error, 482, 512. See also 
Simple linear regression. 

Ridge regression 
Regular case for maximum 
likelihood estimation, 485 
Relevance of past experience, 230 
Relevant subsets, 391 
Reliability, 93 

REML (restricted maximum 
likelihood) estimator, 390, 518 
Rengi’s entropy function, 293 
Residuals, 3 

Restricted Bayes estimator, 321, 

426 

Ridge regression, 425 
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Riemann integral, 10 
Right invariant Haar measure, 247, 
248, 249, 250, 253, 287 
Risk function, 5 

conditions for constancy, 162, 
162 

continuity of, 379 
invariance of, 162 
Risk unbiasedness, 157, 171, 223 
of MRE estimators, 165 
Robust Bayes, 230, 271, 307, 371 
Robustness, 52, 483 

Saddle point expansion, 519 
Sample cdf, see Empirical cdf 
Sample space, 15 

Sample variance, consistency of, 55 
Scale family, 17, 32 
Scale group, 163, 248, 250 
Scale median, 212 
Scale parameter, 167, 223 
Schwarz inequality, 74, 130 
Second order efficiency, 487, 494, 
518 

Second order inclusion 
probabilities, 222 

Sequential binomial sampling, 102, 
233 

Shannon information, 261 
Shrinkage estimator, 354, 366, 424 
factor, 351 
target, 366, 406, 424 
Sigma-additivity, 7 
Sigma field (cr-field), 8 
Simple binomial sampling plan, 103 
Simple function, 9 
Simple linear regression, 180 
Simple random sampling, 198 
Bayes estimation for, 319 
equivariant estimation in, 200 
minimax estimation in, 319 
Single prior Bayes, 239 
Simultaneous estimation, 346, 354 
admissibility in, 350, 418 
equivariant estimation in, 348 


minimax estimation in, 317. See 
also Independent experiments, 
Stein effect 

Singular problem, 110, 144 
Size of population, estimation of, 
101 

Spherically symmetric, 359 
Spurious correlation, 107 
Square root of a positive definite 
matrix, 403 
Squared error, 7 

loss, 50,51,51,90,313 
Standard deviation, 112 
Stationary distribution of a Markov 
chain, see Invariant distribution 
Stationary sequence, 306 
Statistic, 16 

Stein effect, 366, 372, 419 
absence of, 376, 388, 419 
Stein estimation, see Shrinkage 
estimation 

Stein’s identity, 31, 67, 285 
Stein’s loss function, 171, 214 
Stirling number of the 2nd kind, 136 
Stochastic processes, maximum 
likelihood estimation in, 481 
Stopping rule, 233 
Stratified cluster sampling, 206 
Stratified sampling, 22, 203, 222 
Strict convexity, 45, 49 
Strong differentiability, 141, 145 
Strongly unimodal, 502 
Structural parameter, 481 
Student’s /-distribution, Fisher 
information for, 138 
Subgroup, 213, 224 
Subharmonic function, 53, 74 
loss, 53 

Subjective Bayesian approach, 227, 
305 

Subminimax, 312 
Sufficient statistics, 32, 47, 78, 347 
and Bayes estimation, 238 
completeness of, 42, 72 
dimensionality of, 40 
factorization criterion for, 35 
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minimal, 37, 69, 78 
operational significance of, 33 
for a symmetric distribution, 34. 
See also Minimal sufficient 
statistic 

Superefficiency, 440, 515, 515 
Superharmonic function, 53, 74, 
360, 362, 406, 426 
Support of a distribution, 16, 64 
Supporting hyperplane theorem, 52 
Survey sampling, 22, 224 
Symmetric distributions, 22, 50 
sufficient statistics for, 34 
Symmetry, 147. See also Invariance 
Systematic error, 5, 143 
Systematic sampling, 204 

Tail behavior, 51 
Tail minimax, 386 
Tightness (of a family of measures), 
381 

Threshold model, 197 
Tonelli’s theorem, see Fubini’s 
theorem 

Total information, 479 
Total positivity, 394 
Transformation group, 19 
transitive, 162 

Transitive transformation group, 

162 

Translation group, see Location 
group 

Triangular matrix, 20, 65 
Trigamma function, 126, 127 
Truncated distributions, 68, 72 
normal mean, 327 
efficient estimation in, 451 
Tschuprow-Negman allocation, 204 
Tukey model, 474, 510 
Two-sample location family, 159, 
162 

Two-way contingency table, 107, 
194 

Two-way layout, 183, 506 
random effects, 189, 192,478 


U -estimable, 83, 87 
UMVU, see Uniformly minimum 
variance unbiased 
Unbiased in the limit, 431 
Unbiasedness, 5, 83, 143, 284 
in vector-valued case, 347 
Unidentifiable, 24, 56 
Uniform distribution, 18, 34, 36, 70, 
73 

Bayes estimation in, 240 
complete sufficient statistics for, 
42, 42, 70 

maximum likelihood estimation, 
485 

minimal sufficient statistics for, 
38 

MRE estimation for, 154, 172, 
174 

in the plane, 71 
relation to exponential 
distribution, 71 
UMV estimation in, 89 
Uniformly best estimator, 
nonexistence of, 5 
Uniformly minimum variance 
unbiased (UMVU) estimation, 

85, 143 

comparison with MLE, 98, 99 
comparison with MRE 
estimator, 156 
in contingency tables, 194 
example of pathological case, 

108 

in normal linear models, 178 
in random effects model, 189, 
190 

in restricted multinomial 
models, 194 

in sampling from a finite 
population, 200, 203, 206 
of vector parameters, 348 
Unimodal density, 51, 153. See also 
Strongly unimodal 
Unimodular group, 247 
Universal Bayes estimator, 284 
U-shaped, 51, 153,232 
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{/-statistics. 111 

Variance, estimator of, 98, 99, 110, 
110 

in linear models, 178, 184 
nonexistence of unbiased 
estimator of, 132 
nonparametric estimator of, 110 
quadratic unbiased estimator of, 
186 

in simple random sampling, 200 
in stratified random sampling, 
204. See also Normal variance. 
Variance components 


Variance/bias tradeoff, 425 
Variance components, 189, 189, 
237, 323, 477, 478 
negative, 191 

Variance stabilizing transformation, 
76 

Variation reducing, 394 
Vector-valued estimation, 348 

Weak convergence, 57, 60 
Weak differentiability, 141, 145 
Weibull distribution, 65, 468, 487 
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